Modeling The Impact Of AI On Software Development: An Automotive Case Study Master’s Thesis in Computer science and engineering Adam Magnus Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2025 Master’s Thesis 2025 Modeling The Impact Of AI On Software Development: An Automotive Case Study Adam Magnus Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2025 Modeling The Impact Of AI On Software Development: An Automotive Case Study Adam Magnus © Adam Magnus, 2025. Supervisor: Yinan Yu, Department of Computer Science and Engineering Industrial Supervisor: Dhasarathy Parthasarathy, Volvo Trucks Examiner: Hans-Martin Heyn, Department of Computer Science and Engineering Master’s Thesis 2025 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2025 iv Adam Magnus Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract As the trends of integrating artificial intelligence (AI) into software development continue to increase, assessing its impact is crucial, especially in complex, safety- critical domains such as the automotive industry. This study investigates the impact of AI in software development processes through a case study involving two real- world AI-powered solutions at Volvo Trucks: CS-testing and API-testing tools. This study proposes a structured five-phase framework to model the impact of AI from stakeholder-defined perspectives, employing a mixed-methods approach that combines quantitative and qualitative methods. These include interviews, surveys, and various process analyses. The framework focuses on evaluating factors involving quality, efficiency, automation, and stakeholder alignment to categorize and priori- tize metrics and methods. The evaluation shows that AI-driven tools significantly improve two software testing processes by increasing efficiency and quality, addressing prioritized stakeholder pain points, and preserving existing strengths. Moreover, the solutions deliver measur- able value and achieve a high automation level (Level 4), supported by a practical decision tree that helps developers choose suitable automation methods and high- lights their direct impact. Furthermore, this study highlights how AI-driven solutions can facilitate testing workflows and support stakeholders’ decision-making processes. The results show- case a practical methodology for assessing AI’s impact and value in software devel- opment, guiding organizations in determining whether to integrate AI solutions into their processes. Keywords: AI, Impact Assessment, Software Development, Software Engineering, LLM, Software Testing, Process Improvement, Decision-making, Automation v Acknowledgements I would like to express my heartfelt gratitude to all those who have supported me during the course of this thesis. First and foremost, I am sincerely grateful to my industrial supervisor at Volvo Trucks, Dhasarathy Parthasarathy, for his guidance, practical insights, and contin- uous support throughout the project. His real-world perspective played a crucial role in shaping the direction of my work. I would also like to extend my sincere appreciation to my academic supervisor, Yinan Yu, for her expert advice, thoughtful feedback, and unwavering encourage- ment. Her mentorship was instrumental in helping me navigate both the challenges and milestones of this research. I would like to thank my examiner, Hans-Martin Heyn, for his time, thoughtful eval- uation, and constructive feedback, which helped enhance the quality and clarity of this thesis. To my colleagues and friends, thank you for your support, stimulating discussions, and the occasional well-needed distractions. Your presence made this journey far more enjoyable and manageable. And most importantly, I am deeply indebted to my family for their unconditional love, patience, and belief in me. Their support has been my greatest source of strength throughout this endeavor. Adam Magnus, Gothenburg, June 2025 vii Contents List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Purpose of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Background 5 2.1 Generative AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 AI in Software Engineering . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Automotive Software Engineering . . . . . . . . . . . . . . . . . . . . 8 2.4 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4.1 Control System Testing . . . . . . . . . . . . . . . . . . . . . . 9 2.4.2 API-Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.3 Polymer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5 Impact Dimensions: Business and Technical . . . . . . . . . . . . . . 12 3 Related Work 13 3.1 Risk evaluation and automation taxonomy . . . . . . . . . . . . . . . 13 3.2 Impact on productivity and research gap . . . . . . . . . . . . . . . . 14 3.3 Economics of Software Engineering . . . . . . . . . . . . . . . . . . . 15 3.3.1 Return On Software Quality (ROSQ) . . . . . . . . . . . . . . 15 3.4 Software Process Improvement (SPI) and Return on Investment(ROI) 16 3.4.1 Capability Maturity Model (CMM) and Capability Maturity Model Integration (CMMI) . . . . . . . . . . . . . . . . . . . . 16 3.4.2 Pareto Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4.3 Quality Improvement Initiatives . . . . . . . . . . . . . . . . . 17 3.5 Project Size Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.5.1 Lines of Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.5.2 Function Point Analysis (FPA) . . . . . . . . . . . . . . . . . 18 3.5.3 COSMIC Function Point . . . . . . . . . . . . . . . . . . . . . 19 3.6 Value Stream Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4 Methods 21 ix Contents 4.1 Research Design and Approach . . . . . . . . . . . . . . . . . . . . . 21 4.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2.1 Intertwined quality and Efficiency . . . . . . . . . . . . . . . . 22 4.2.2 Surveys and Interviews . . . . . . . . . . . . . . . . . . . . . . 22 4.2.3 Factors and Issues Prioritization . . . . . . . . . . . . . . . . . 23 4.2.4 Software Process Improvement Initiatives (SPII) . . . . . . . . 23 4.2.5 Automation Levels and risk . . . . . . . . . . . . . . . . . . . 23 4.3 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.4 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5 Results 29 5.1 Interview and Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.1.1 Round 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.1.2 Round 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.1.3 Round 3, 4, and 5 . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 Pareto Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2.1 CS-Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2.2 API-Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.3 Software Process Improvement Initiatives (SPII) . . . . . . . . . . . . 37 5.3.1 CS-Testing Tool . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3.2 API-Testing Tool . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3.3 Gap Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.4 Automation Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.5 Automation Levels and Risk . . . . . . . . . . . . . . . . . . . . . . . 42 6 Discussion 43 6.1 Addressing Research Questions . . . . . . . . . . . . . . . . . . . . . 43 6.1.1 Research Question 1 . . . . . . . . . . . . . . . . . . . . . . . 43 6.1.2 Research Question 2 . . . . . . . . . . . . . . . . . . . . . . . 43 6.1.2.1 Significance Weighting Metrics . . . . . . . . . . . . 44 6.1.2.2 Factor-Specific Metrics . . . . . . . . . . . . . . . . . 44 6.1.2.3 Component-Specific Metrics . . . . . . . . . . . . . . 45 6.1.2.4 Classification Metrics . . . . . . . . . . . . . . . . . . 45 6.1.3 Research Question 3 . . . . . . . . . . . . . . . . . . . . . . . 45 6.1.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . 46 6.1.3.2 Tying It All Together . . . . . . . . . . . . . . . . . 47 6.1.4 Research Question 4 . . . . . . . . . . . . . . . . . . . . . . . 47 6.1.4.1 Pareto Analysis and Diagrams . . . . . . . . . . . . . 48 6.1.4.2 SPII . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.1.4.3 Automation Decision Tree . . . . . . . . . . . . . . . 48 6.1.4.4 Tying It All Together . . . . . . . . . . . . . . . . . 48 6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.2.1 Result Compilation and Pre and Post Comparison . . . . . . . 49 6.2.2 Constraints, Limitations and Risk . . . . . . . . . . . . . . . . 49 6.2.3 Automatability Levels of the framework . . . . . . . . . . . . 49 6.2.4 Longitudinal Studies and Feedback Loops . . . . . . . . . . . 50 6.2.5 Expanding to other domains and focuses . . . . . . . . . . . . 50 x Contents 7 Validity Threats and Limitations 51 7.1 Internal Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7.1.1 Sampling Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7.1.2 Research Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7.1.3 Social Desirability . . . . . . . . . . . . . . . . . . . . . . . . . 52 7.2 External Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 8 Conclusion 53 Bibliography 55 A Appendix I A.1 Non-utilized Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . I A.2 Utilized Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV A.3 Round 2 interview/survey questions . . . . . . . . . . . . . . . . . . . VII xi Contents xii List of Figures 2.1 Software Develoment Lifecycle [15][3] . . . . . . . . . . . . . . . . . . 6 2.2 Domain-centralized E/E system example [19] . . . . . . . . . . . . . . 9 2.3 General overview of the CS-Testing Process . . . . . . . . . . . . . . 10 2.4 General overview of the API-Testing process . . . . . . . . . . . . . . 10 2.5 Manual Process vs AI solution Process [30] . . . . . . . . . . . . . . . 11 3.1 Levels of risk on the AI-SEAL taxonomy with respect to Points of Application and Levels of Automation [10] . . . . . . . . . . . . . . . 13 3.2 Levels of automation - DAnTE scale [18] . . . . . . . . . . . . . . . . 14 3.3 Function Points Measurement Model [1] . . . . . . . . . . . . . . . . 19 4.1 Research plan overview. . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2 Framework workflow overview . . . . . . . . . . . . . . . . . . . . . . 26 5.1 Pareto Chart - CS-Testing quality and efficiency factors . . . . . . . . 32 5.2 Pareto Chart - API-Testing quality and efficiency factors . . . . . . . 35 5.3 Main AI components of the CS-Testing tool and the API-Testing tool. 37 5.4 SPII vs Quality and efficiency factors. . . . . . . . . . . . . . . . . . . 38 5.5 Decision Tree Visualization . . . . . . . . . . . . . . . . . . . . . . . . 41 xiii List of Figures xiv List of Tables 3.1 Function points measurement model . . . . . . . . . . . . . . . . . . 18 3.2 Function point weights as proposed by Albrecht (1983) [1] . . . . . . 19 3.3 Example of a function point following the Albrecht 83 version [1] . . . 20 5.1 Manual CS-Testing Teams and Roles . . . . . . . . . . . . . . . . . . 29 5.2 Manual API-Testing Teams and Roles . . . . . . . . . . . . . . . . . 30 5.3 CS-Testing Teams, Roles and Experience Levels . . . . . . . . . . . . 30 5.4 Number of factors identified per category - CS-Testing (Raw) . . . . . 30 5.5 Number of factors identified per category - API-Testing (Raw) . . . . 31 5.6 Legend for CS-Testing - High Efficiency Factors . . . . . . . . . . . . 32 5.7 Legend for CS-Testing - High Quality Factors . . . . . . . . . . . . . 33 5.8 Legend for CS-Testing - Low Efficiency Factors . . . . . . . . . . . . 33 5.9 Legend for CS-Testing - Low Quality Factors . . . . . . . . . . . . . . 33 5.10 Legend for API-Testing - High Efficiency Factors . . . . . . . . . . . 35 5.11 Legend for API-Testing - High Quality Factors . . . . . . . . . . . . . 35 5.12 Legend for API-Testing - Low Efficiency Factors . . . . . . . . . . . . 36 5.13 Legend for API-Testing - Low Quality Factors . . . . . . . . . . . . . 36 5.14 Factors considered when automating . . . . . . . . . . . . . . . . . . 40 A.1 Full non-utilized methods classification . . . . . . . . . . . . . . . . . II A.2 Full utilized methods classification . . . . . . . . . . . . . . . . . . . . V xv List of Tables xvi 1 Introduction 1.1 Problem Description With the widespread adoption of Artificial Intelligence (AI), its ability to enhance and optimize processes is ever more sought after, especially in the software devel- opment industry [27]. However, with the constant evolution of AI and the growing popularity of integrating AI, it is crucial to evaluate its impact critically and deter- mine whether AI offers the best solution needed. There are numerous forms of AI, with generative AI being one of the most trending and having the most significant potential. This is due to its capability to generate content such as text and code in an automated fashion, offering solutions in various industries. Integrating gen- erative AI in the software development industry shows great potential in bringing great value to companies by being a driver of automation and optimization for mul- tiple processes that traditionally require immense manual effort, time, and money [26]. However, it is critical to understand the actual value and risks of employing AI-driven solutions, especially in highly complex and safety-critical sectors such as the automotive industry. Regarding current research and industry practices, the primary focus is on the capabilities of generative AI and ways to incorporate gener- ative AI within different processes. However, there still lacks a system for evaluation and modeling of the impact of AI within software development processes, as well as a significant lack of critical analysis and evaluation of its necessity, efficiency, and value. This is especially significant when it comes to the automotive industry, where there exist unique and critical standards that must be employed. At Volvo Trucks, multiple AI-driven solutions are being developed across many different processes. One of those is the software development process, specifically the testing process. With testing being arguably one of the most important aspects of safety-critical industries, it is of immense importance to ensure high reliability and accuracy. In this study, two projects were investigated, utilizing Large Language Models (LLMs) at their core. Although AI might have a clear and apparent direct impact on the testing process, it remains unclear what the impact is on the broader software development process, as there will be financial and technical impact on many different aspects such as costs, time, employees, quality, etc. This raises many questions, such as whether AI is the best solution in specific scenarios, how AI fits in the current workflows employed at the company, whether the gains outweigh the 1 1. Introduction risks, and many more. The focus of this study is on AI in software engineering, and it is explored and validated through an in-depth examination of the two software testing processes at Volvo Trucks. 1.2 Purpose of the Study The main purpose of this study is to develop and validate a structured framework for modeling the impact of generative AI-driven solutions within software development, with a specific focus on manual testing processes in the automotive domain, par- ticularly control system and API testing at Volvo Trucks. This includes identifying what constitutes “impact” in this context and exploring how it can be meaningfully assessed and aligned with stakeholder needs. This study aims to yield value-based results, which refer to outcomes or findings that are evaluated and interpreted in terms of the value they deliver to stakehold- ers, rather than just technical performance or isolated metrics. These results and their implications can assist and support practitioners and stakeholders in their decision-making processes, based on the conditions under which AI brings positive net benefits. In this case, practitioners refer to those involved in hands-on work, such as software developers and AI specialists, who are engaged in the development and implementa- tion of AI tools. Stakeholders, on the other hand, include individuals or groups that have an interest or investment in these tools, such as end users, decision-makers, or others impacted by their adoption and use, even if they are not directly involved in the development process. By grounding the research in the testing workflows at Volvo Trucks, the study provides actionable insights for the broader software development industry. This is achieved through methods, frameworks, and roadmaps that support planning, forecasting, and evaluating the value of developing or integrating AI-based solutions into software processes. 1.3 Research Questions To achieve the purpose of the study, the study will focus on the following research questions: RQ1: How is "impact" defined in a software development context? How can we model the impact of AI without knowing how the term "impact" is defined? Before we can model it, we need to agree on what we are actually targeting. RQ2: What impact metrics are applicable and how can these metrics be categorized and prioritized? There are many different metrics out there, and not every metric is worth pursuing, so the aim is to focus on importance and relevance. 2 1. Introduction RQ3: What methodologies can be employed to measure the prioritized impact metrics both quantitatively and qualitatively, and how can these methodologies be applied to practical cases involving generative AI- driven solutions within software development? This question focuses on con- necting the theoretical with the tangible, asking not just what works on paper, but what holds up in practice. RQ4: How can these impact metrics be modeled to address the needs of target stakeholders and support their decision-making process when it comes to the integration of generative AI in software development processes? Metrics alone do not provide much unless they are met with real needs. This question explores how to shape them into models tailored towards specific target stakeholders and how the models can be used to support them with their decision-making process involving the integration of AI and AI-based solutions in their processes. 1.4 Significance of the Study This paper will set a foundation for researchers to contribute and further customize models regarding the impact of AI in other contexts and scenarios, both within and outside of software development. This will be achieved through bridging the gap between theoretical frameworks and practical applications. This study will benefit practitioners and software developers by providing methodologies and a road map to evaluate and understand the impact of AI. Furthermore, the findings of this study could allow organizations and decision makers to make informed decisions about adopting AI solutions and optimize their integration into the workflows and processes. 3 1. Introduction 4 2 Background 2.1 Generative AI Generative AI is a form of AI designed to create new content, data, or solutions that mimic human-made content. Generative AI models are trained to produce and generate coherent outputs such as text, images, audio, videos, or code, unlike traditional AI models that mainly perform classification and recognition-based tasks [12]. To understand generative AI further, Feuerriegel et al. proposed a three-tier conceptual framework [11]: 1. Model Level There are different model architectures related to generative modeling, such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Transformer-based models like GPT. In order to for these models to recognize and replicate complex patterns in data and to generate relevant and realistic outputs, they are typically trained on massive datasets [11]. 2. System Level Models alone cannot function well independently, as their functionality needs to be embedded "to provide an interface for interaction" [11]. An example of this is seen in Codex’s case [5], where the deep learning model would need to be integrated "into a more interactive and comprehensive system, like GitHub Copilot" [11], allowing a more efficient way for users to code. 3. Application Level There are many ways to use generative AI in real-world domains, ranging from fields such as software development (e.g., code generation), marketing (e.g., auto- mated content production), customer service (e.g., chatbots), and healthcare (e.g., synthetic data for research). Most of these tasks tend to be regarded as "human- task-technology systems" that use generative AI to "augment human capacities" and whose success usually depends on how well they are integrated into the different workflows and processes, and if they are well-trusted and understood by their users [11]. Similarly, Nah et al. highlight the importance of having an AI-human collaboration allowing for a higher level of automation by having the AI focus on generating the 5 2. Background relevant content while having human oversight, as it is still critical to evaluate the generated output quality, contextual relevance, and ethics [12]. 2.2 AI in Software Engineering Barenkamp et al.[3] conducted systematic reviews to assess the state of AI, potential developments, and application of AI within the software development life cycle. Their analysis identified multiple uses and advantages of AI, where some of those advantages are the acceleration of the development processes and workflows as well as reducing costs through the automation of time-consuming routine tasks [3]. Below are some of the identified uses of AI within the software development life cycle: Figure 2.1: Software Develoment Lifecycle [15][3] 1. Project Planning In their literature reviews, Barenkamp et al. identified that AI can help with effort and cost estimation by leveraging historical project data and Bayesian models. AI can also perform risk prediction and management by identifying patterns such as cost overruns, resource issues, and delays based on historical project data through pre- dictive modeling and analysis. Furthermore, AI can help allocate human resources 6 2. Background and assign tasks based on the developer’s skills, availability, and performance. With more advanced models, they can also provide dynamic planning that allows realloca- tion of resources based on unavailability or unforeseen circumstances [3]. Pothukuchi et al. also highlight AI’s ability to assist with requirement gathering, stating that AI can accelerate proposal writing as well as assist with defining the project’s scope. Furthermore, some AI platforms can also increase the efficiency and alleviate the effort of analysts writing "user stories in translating business requirements" [23]. 2. Problem Analysis Similarly to the planning phase, AI can further help with failure point prediction by predicting potential financial, technical, and operational risks within software projects. This is done through machine learning algorithms that can predict proba- ble failure points by finding correlations and patterns in large datasets. Furthermore, the use of discriminant analysis, which, based on historical project data, can classify and weigh the different risk factors. AI can also help identify common issues and needs to provide a better understanding of the factors and issues the software should be addressing [3]. 3. Software Design AI can help accelerate design decision-making by providing design templates and recommending architectural patterns based on the given project context through the use of machine learning. Furthermore, AI can help generate content such as UI designs or provide an idea of expected behavior. AI can also play a role in choice validation regarding different areas within software design, such as database design and component interaction, by providing different alternatives to design and archi- tecture, as well as simulating their performance [3]. Additionally, AI contributes to solidifying the software’s design by analyzing the weaknesses, providing alternative design suggestions, generating user experience mockups, and writing system design documentation [23]. 4. Software Implementation One of the most prominent uses of AI in the software implementation phase is code generation, where AI can generate executable code based on human language de- scriptions. Furthermore, AI can help with debugging and error detection and assist with completing code and providing suggestions, which can help reduce effort and increase efficiency. Additionally, it possesses the ability to translate code between different programming languages and refactoring code [3]. Pothukuchi et al. expand on these insights, stating that AI can help with optimizing code by organizing code structures and using memory efficiently, which in turn results in better resource utilization and performance [23]. 5. Software Testing and Integration AI can help with generating test cases automatically, which in turn saves time and potentially results in higher test coverage. Also, by recognizing patterns and using deep learning, it can detect bugs and common issues in the code. Even though fuzz testing is not based on AI, it can be combined with some AI elements to find vulnerabilities, which is done by injecting random or incorrect inputs into the 7 2. Background software [3], allowing for edge case identification [23]. 6. Software Maintenance With some AI tools, it is possible to provide live anomaly detection and generate some code fixes. They can also refactor and clean up code based on unused, re- dundant, or duplicate code. AI can also help with automating updates and release patches based on changes affecting the system, such as security threats, dependency updates, and more. Furthermore, it can solidify security measures by simulating attacks and continuous monitoring of the code and configuration [3]. AI can also assist with reporting and monitoring, allowing the analysis of user behavior, pro- viding application owners with custom reports as well as ensuring compliance with policies and standards [23]. To contextualize the research, the focus will be on stage 5, "Testing and Integration," which closely relates to the projects that will be explored for this case study. 2.3 Automotive Software Engineering The automotive Electrical/Electronic system acts as an essential platform that con- nects and integrates the vehicle’s hardware and software. It connects sensors and actuators with microcontrollers in order to translate "any physical domain into the digital domain" [19]. In most cases, microcontrollers are the vehicles’ ECUs (Elec- tronic Control Units), and by combining sensors, actuators, and ECUs, one can achieve automated functionality. This is done by forming a distributed and net- worked system for monitoring and controlling the vehicle’s behaviors through com- munication networks such as CAN (Controller Area Network) and Ethernet, an example of this can be seen in Figure 2.2. The purpose of sensors is to measure cer- tain physical properties, while actuators are used to execute the different commands based on software decisions. As for ECUs, their main purpose is to process data in the embedded software system to provide real-time decisions. Typically, each ECU is focused on a specific domain, such as the powertrain domain, which in turn con- trols specific functionality related to the vehicle, such as engine management. This directly ties to the focus of the cases this study will focus on. An example of a few vehicle ECUs can also be seen in Figure 2.2 [19]. 2.4 Case Study This section discusses the two cases this research used to evaluate the impact model. It examines the current processes employed by the company, outlines the intended functionality of the proposed AI solution, and provides background information on the model and methodology. Both projects follow the Polymer framework, which is explained later in Section 2.4.3, and are ongoing with varying maturity levels. Furthermore, all the interviews and surveys were conducted with the developers and stakeholders of these projects. 8 2. Background Figure 2.2: Domain-centralized E/E system example [19] 2.4.1 Control System Testing For the first case, the focus is on a software testing process based on control sys- tem testing, which will hence be referred to as the Control System Testing process or CS-Testing process. This process is typically done by different teams, of which there were two teams that the solution developers worked closely with. Each team is typically responsible for a specific ECU in the vehicle. This process starts with the team receiving the requirements, and based on those requirements, they would trace the required signals and their respective attributes. The next phase would in- clude writing the positive, negative, and boundary cases for the given requirement. The tester would then write the tests and ensure they cover all the required scenar- ios. Based on the requirement, they would either require hardware testing, which uses physical components, or they would need software testing, which can be done virtually. Both are usually executed on a rig, an isolated self-contained execution environment, that can be used to simulate vehicle behavior and data. The tests would then be executed, and a report would be generated, showing the results. This process tends to require constant inter-team collaboration and communication. As there are many teams involved before, during, and after the process, there is a high variability in quality, consistency, and structure of the signals, signal attributes, and even the test cases. This led to the proposal of this project, which is an AI solution with the goal of automating this process; it is referred to as the CS-Testing tool. During the period of this study, three main developers, with an experience level of 1-2 years in AI engineering, worked on this project alongside a few interns. However, only the solution developers were considered for this study. The main purpose of this tool is to alleviate problems such as difficulty tracing signals due to different 9 2. Background naming conventions, lack of coverage, consistency, and many more. Figure 2.3 shows a general overview of the CS-Testing process. Due to confidentiality restrictions, the process diagram is simplified to encompass the general tasks and flow. Function Developer Function Who updated Owner requirement Review and Validation Writing Tests Understanding Goal: Cover all scenarios (Valid + Invalid + Edge Software Specifications (In documentation tool) Cases) OR CI/CD Rewrite Req Create algorithm: Update Main Requirement Signal Tracing positive, negative Write Test Cases Review Test Cases Run Tests Report Specification and navigation If Req is not testable and boundary cases Store in Rig Scheduling and proprietary Setup documentation tool Hardware Tester If Fail Figure 2.3: General overview of the CS-Testing Process 2.4.2 API-Testing Similarly, this process is a software testing process focused on API testing and will therefore be referred to as the API-Testing process. In this case, the solution de- velopers worked closely with one team. This process starts with receiving requests from another team in the form of APIs and YAML specifications. Then, a plan- ning phase takes place to understand what things are required, which protocols are needed, and if the CAN signals exist in the databases. Once that is done, the sig- nals are mapped from Virtual Vehicle (VV) [30] with the API properties from the databases. Then, the tests are written, which requires setting CAN signals on the rig and undergoing some signal tracing. Once that is done, the tests are executed using API requests and then validated against the expected values. If all tests pass, they are then passed on to another team; if they fail, the testers must identify the cause and then either fix it or communicate with the API developer team (if it is a fault in the API). Figure 2.4 shows the general overview of the manual API-testing process. Similarly to the CS-Testing process, two main developers, with an experi- ence level of 1-2 years in AI engineering, worked on this project with the support of a few interns. In this case, only the solution developers were considered for this study. Due to confidentiality restrictions, the process diagram is also simplified, encompassing the general tasks and flow. Setup Writing Tests Preparation for writing tests Goal: Cover all scenarios (Valid + Invalid CAN signal setting) Receive Validate results against Plan and Mapping of CAN Specifications and Set CAN signals Manually map Perform API input Nightly Runs Understand signals and API Construct Data corresponding generated Write test cases (Result vs set CAN signals) API endpoints through VV requests Running Tests Specifications parameters objects to CAN signals Must trace fails and fix if (New or updated) testers failure Figure 2.4: General overview of the API-Testing process A case study by Wang et al. discusses the proposed AI solution, the SPAPI-Tester, which is developed with the purpose of automating in-vehicle API testing. Since this is based on the SPAPI-Tester, the project is referred to as the API-testing tool. It is proposed to help mitigate many issues faced while testing vehicle APIs, such as the numerous systems (eg, CAN signals and VV simulators) being managed by many different teams, making the process complex, which also tends to result in 10 2. Background inconsistencies in documentation and extensive effort and time needed for testing. Figure 2.5 shows the general manual and automatic steps from both the current process and the proposed AI solution [30]. Figure 2.5: Manual Process vs AI solution Process [30] 2.4.3 Polymer With the increase in complexity of software development processes, especially on a large scale, there is a great need for an automated approach to manage the different workflows. Parthasarathy et al introduced Polymer [20], which is a methodology that reimagines software development workflow as different programmable entities. By leveraging the power of LLMs, it is possible to automate workflows and processes that were either not possible to automate or extremely difficult to automate that pertain to the earlier phases within the software development lifecycle. This relates to the earlier stages of the software development lifecycle mentioned previously and seen in Figure 2.1, essentially targeting all the stages instead of the standard au- tomation of the last stages, coupling with Barenkamp et al’s research involving AI in software engineering. Polymer showed practical performance through real-world efforts used at Volvo Trucks, where the LLM powered processes resulted in a signifi- cant reduction in manual effort as shown in Spapi-tester, described as a "SW-defined workflow that automates the test process", that automated 2-3 full-time equivalents (FTEs) in the expense of two months of developing and deploying. As well as the piloting of the Spapi-coder, an implementation workflow, with an estimate of saving and automating 15-20 FTEs worth of development time. This significant research merges the gap between the software development life cycle and utilizing AI and 11 2. Background LLMs as "Skeleton Keys" and allowing for overcoming the technical and economic challenges of automation within software development [20]. 2.5 Impact Dimensions: Business and Technical Biffl et al. argue that there is no single definition of the term "value" and that it refers to the benefit derived from software, services, or processes. Stakeholders may be driven by goals as both individuals and collectives with the "hope to derive some benefit" [2]. These benefits may come in many forms and may fall under different categories. Biffl et al. highlight some categories and forms: "tangible or intangible, economic or social, monetary or utilitarian, or even aesthetic or ethical". They further solidify the definition of the term value, stating that the term value refers to the "ultimate benefit, which is often in the eye of the beholder and admits multiple characterizations", meaning that value is stakeholder-defined [2]. The framework and methodology of this research will follow the same approach of stakeholder-defined impact. 12 3 Related Work 3.1 Risk evaluation and automation taxonomy A paper by Feldt et al introduces a taxonomy that categorizes the different applica- tions of AI in software engineering, highlighting the importance of risk assessment [10]. They discuss three main facets: point of application, type of AI, and level of automation, and how the level of risk correlates to the point of application and the automation level - the higher the point of application and the higher the level of automation, the higher the risk. This can be seen in Figure 3.1. The points of appli- cation are categorized into three main categories: process level, product level, and runtime, while the automation levels are categorized into 10 levels; level 1 being com- pletely human decision and level 10 being completely autonomous [10]. They also highlight the importance of using such a taxonomy in companies’ decision-making processes regarding the integration of AI. This taxonomy offers a structured frame- work for researchers to understand the different ways AI can be used in software engineering and defines different terms and levels in order to facilitate communica- tion [10]. Figure 3.1: Levels of risk on the AI-SEAL taxonomy with respect to Points of Application and Levels of Automation [10] 13 3. Related Work As software engineering has been pushing to reduce the time and efforts needed for development and to increase productivity, Melegati and Guerra have created and proposed the DAnTE taxonomy, a six-level automation degree of software develop- ment tasks, providing a means to categorize and understand the role of automation ranging from manual processes to fully autonomous generation [18]. The degrees of automation can be seen in Figure 3.2. Figure 3.2: Levels of automation - DAnTE scale [18] 3.2 Impact on productivity and research gap It is evident that AI solutions have an impact on software development. A paper by Peng et al. conducted a controlled trial of GitHub Copilot to investigate its effects on developers’ productivity by asking 95 professional programmers to create an HTTP server in JavaScript and running test suites on their GitHub repositories, comprising 12 checks, where all 12 checks must pass for the task to be completed [22]. They were then assessed based on task success and task completion time. Results showed that “less experienced developers, developers with heavy coding load, and older developers benefit more from Copilot," where the tasks were completed 55.8% faster than the control group [22]. This evidently shows that AI can help productivity by a decent margin. However, they mentioned that the results may vary with other tasks and that more research is required in order to generalize their findings to other tasks, and that “further investigations into the productivity impacts of AI-powered tools in software development are warranted” [22]. 14 3. Related Work 3.3 Economics of Software Engineering 3.3.1 Return On Software Quality (ROSQ) When looking at the cost of software quality, many different types and categories may arise. A report evaluating the cost of software quality highlighted two major types of quality costs: conformance and nonconformance. Slaughter et al. defined conformance costs as "the amount spent to achieve quality products" [28]. Non- conformance costs are defined as "all expenses that are incurred when things go wrong". For each major type of quality costs, they have mentioned a few types of their respective costs [28]: Conformance • Prevention Costs: This type of conformance cost relates to the costs incurred in order to "prevent defects before they happen". The examples provided are "costs of training staff in design methodologies, quality improvement meetings, and software design reviews" [28]. • Appraisal Costs: Slaughter et al. state that these costs "include measuring, evaluating, or au- diting products to assure conformance to quality standards and performance". Examples of this include "code inspections, testing and software measurement activities" [28]. Nonconformance • Internal Failure: This type of nonconformance cost relates to all expenses that take place before the "product is shipped to the customer". Examples of this include "costs of rework, re-inspection and retesting" [28]. • External Failure: Slaughter et al. define external failure costs as "costs that arise from prod- uct failure at the customer site". The examples stated are "field service and support, maintenance, liability, damages, and litigation expenses" [28]. Furthermore, their approach to defining and measuring software quality is through the costs of software failure. The goal is to maximize the profit that can be attained by fixing defects as early as possible within the software’s life cycle. This is due to the fact that the later you are in the life cycle, the higher the cost of defect correction is. Hence, the cost of software quality is considered a metric and is used when calculating the Return on Software Quality [28]. Return on Software Quality is a way to measure the financial benefits of investing in software quality. The main idea behind the authors’ explanation is that "software quality expenditures must be financially justified" and that "software quality is an 15 3. Related Work investment that should provide a financial return". They evaluate the expenditure through software quality improvement initiatives, where the examples of such ini- tiatives are "design reviews, testing, debugging tools, code walkthrough, and quality audits". They should result in software quality revenue (SQR), which is "derived from the projected increases in sales or estimated cost savings due to the software quality improvement". There are two main forms of investments in this situation: Software Quality Investments (SQI), which are the initial costs of "training, tools, efforts, and materials," and Software Quality Maintenance (SQM), which are all ongoing costs used to maintain quality [28]. 3.4 Software Process Improvement (SPI) and Re- turn on Investment(ROI) The main aim of Software Process Improvement (SPI) is to provide structure and optimization for processes, creating "more effective and efficient software develop- ment and maintenance". Van Solingen believes that an organization is likely to produce timely and budget-compliant products if they are well managed and has well-defined processes, specifically, engineering processes [29]. When following SPI methods, SPI investments need to be justified, typically in the form of Return on Investment (ROI), similar to the claims of Slaughter et al. [28]. This also allows or- ganizations and managers to prioritize process improvements and allocate resources, maximizing the benefits [29]. 3.4.1 Capability Maturity Model (CMM) and Capability Maturity Model Integration (CMMI) Based on a previously established model of the Capability Maturity Model (CMM), Paulk et al. played a crucial role in the development of version 1.1. Paulk et al. outline CMM as a structured framework used to assess and improve the software de- velopment process following a five-level maturity categorization: Initial, Repeatable, Defined, Managed, and Optimizing [21]. Each of these represents a progression re- garding the organization’s process control and standardization. These describe how processes are initially unpredictable and reactive, while moving towards the opti- mizing level, processes improve increasingly based on qualitative feedback and ideas. This is typically used to guide organizations around process deficiencies and to im- plement improvements in order to enhance the effectiveness of software quality and management [21]. Van Solingen also mentions Capability Maturity Model (CMM), stating that process improvements need to be tied to measurable outcomes, specifi- cally in relation to business value or ROI rather than aiming for a higher maturity level [29]. Similarly, Gallagher discusses the Capability Maturity Model Integration (CMMI), which is built on CMM, describing it as a way to integrate different maturity models into a framework by exploring beyond software engineering and venturing into other areas involving system engineering, product and process development, and supplier 16 3. Related Work sourcing. Similar to CMM, it also follows a five maturity levels, including: Initial, Managed, Defined, Quantitatively Managed, and Optimizing. This enables orga- nizations to implement process improvement initiatives across various disciplines [13]. 3.4.2 Pareto Analysis Pareto Analysis emerged from the observation of uneven distribution in economic wealth and operational results, it is known as the 80/20 rule, which is the idea that 80% of the benefits come from 20% of the efforts. Essentially, it means that the majority of the results can be traced back to a minority of inputs. In a business context, this would mean that the company should focus on the set of efforts, such as products or customers, that lead to the majority of the results, such as the company’s revenues and profits. Using this method can help organizations identify internal strengths and weaknesses. Powell and Sammut-Bonnici state that "in many businesses there is a strong tendency to add new products and customers while failing to eliminate those which are obsolete or unprofitable" which can also relate to the possibility that those obsolete or unprofitable products and/or customers may very well account for the majority of the costs [24]. 3.4.3 Quality Improvement Initiatives In the BDM International example, Slaughter et al. showcase how they first used Pareto Analysis to identify what the main defects causing most of the problems were then dug deeper into the root of the main problem causing issues using the fishbone diagram, a cause and effect analysis method, to find the root cause of the problem. In this case, the root cause of the JCL errors [28]. With the results of the Pareto analysis and cause-effect analysis, they were able to improve upon the different process improvement initiatives. In this case, the main focus was on reducing defects and attempting to eliminate failure costs. With this, they were able to directly trace the efforts and results to each of those initiatives while focusing on the top-most impactful problems. With each process improvement initiative clearly defined, it is possible to visualize the results and impact. For example, evaluating the process improvement against defect density, ROSQ, cost of quality, etc [28]. 3.5 Project Size Estimation 3.5.1 Lines of Code One of the easiest and widely known methods of software project effort estimation is lines of code (LOC), which is usually shown as thousand lines of code (KLOC). This is usually used in different cost estimation models alongside other constants based on different factors such as complexity, different environments, or practices. Some of those models are the Walston-Felix model, the Bailey-Basili model, the Boehm models (COCOMO), and the Doty model. One of the main issues with such an approach is the lack of definition of LOC or KLOC, as there is no universal 17 3. Related Work definition of what falls under LOC. For example, some consider comment lines as lines of code [17]. It is also important to note that each line of code can vary greatly, especially when considering different programming languages. Another downside of this approach is that it is quite difficult to estimate the LOC of a project in the early stages of a project’s life cycle. Furthermore, it mainly considers the coding aspect of a project, which, according to Emrick, makes up only 10-15% of the total effort [9][17]. 3.5.2 Function Point Analysis (FPA) Similar to LOC, FPA is a method to measure the size of a project. However, it also measures the complexity of the software based on the software’s functionality from the user’s perspective. This not only allows the measurement to be language independent but also allows for estimating the project’s effort early on in its life cycle. There have been many different officially released versions of FPA, including the Albrecht 79, Albrecht 83, and many International Function Point Users Group versions (IFPUG), which were established from 1984 onwards to standardize the approach and set specific rules. The way that FPA is measured is based on five factors and assigning a weight derived from perceived complexity for each of those factors [1]. An empirical study on Function Point Analysis conducted by Abran et al. states that the five factors that make up the unadjusted function points (UFP) use two dif- ferent measurement processes [1]. The five factors and their respective measurement processes are shown in Table 3.1. Table 3.1: Function points measurement model Name Command Data Measurement Process Internal Logical Files (ILF) External Logical Files (EIF) Transaction Measurement Process External Inputs (EI) External Outputs (EO) External Inquiries (EQ) Even though this may be useful, UFP alone is not enough to get a good estimation. To combat this problem, they highlight the importance of Value Adjustment Factor (VAF), which is used to "assess the environment and processing complexity of the software application as a whole" [1]. The general model to calculate the FPA is shown in Figure 3.3. The calculation of VAF is based on 14 predefined general system characteristics (GSC), where each characteristic is assigned a weight based on the predefined defi- nition. VAF is then calculated using this equation: V AF = (0.65 + TotalGSC) 18 3. Related Work Figure 3.3: Function Points Measurement Model [1] In Table 3.2 an example of how the weights of each function type is assigned using the Albrecht 83 version, where each function type is given three complexity weights: Low, Average, High [1]. Table 3.2: Function point weights as proposed by Albrecht (1983) [1] Albrecht 83 # Function Types Low Average High 1 Internal logical files 7 10 15 2 External interface files 5 7 10 3 External inputs 3 4 6 4 External outputs 4 5 7 5 External inquiries 3 4 6 In Table 3.3 an example is shown using the measurement model in Figure 3.3 and the example values shown in Table 3.2. 3.5.3 COSMIC Function Point COSMIC function point (CFP) is derived from the traditional function point analy- sis, and it is said to be a less complex approach to estimating software project effort. CFPs are more suitable for service-oriented, real-time, and embedded software sys- tems. The main difference between the traditional version and the COSMIC version is that COSMIC does not use complexity weights or value adjustment factors and focuses more on data movement rather than functionality from an end-user perspec- tive. It is calculated based on data entry, exit, read, and write, where each data movement is counted as one CFP [6][7]. Furthermore, it can be used within the 19 3. Related Work Table 3.3: Example of a function point following the Albrecht 83 version [1] Example of a Function Point Count — Albrecht 83 Version Function Types No. Functions * Weights = UFP Complexity Adjustment Factor Internal logical files 3 ∗ 10 = 30 GSC 1 to 11 = .00 External interface files 0 ∗ 7 = 0 External Inputs 2 ∗ 4 = 8 GSC 12 to 14 = .04 each External Outputs 2 ∗ 5 = 10 External Inquiries 5 ∗ 4 = 20 Total = .12 Total UFP = 68 VAF = (.65 + .12) = .77 Adjusted Function Points = UFP * VAF = 68 * (0.77) = 52 AFP management and organizational side of software development, where it can function as a metric for scope management, resourcing, productivity, and quality. It also allows tracking the costs per CFP, which can help companies set and reach the goal of better "value for money". Moreover, this can allow teams to track their effort per sprint using CFP, where they can set CFP per sprint goals [7]. 3.6 Value Stream Mapping Value Stream Mapping (VSM) is a lean management tool that is used to anal- yse and optimise the flow of material and information related to production and service processes. According to Rother and Shook, VSM enables identification of "value-added" and "non-value-added" steps visually in order to help target waste and improve process efficiency [25]. To help with this process, Langstrand proposes some steps as guidance for creating the Value Stream Map and analysis. The first phase includes "creating the current state map," which involves drawing up the process, then analysing and highlighting the flow of information and materials within that process, adding the relevant data to the timeline with the relevant calculations. The second phase focuses on "analyzing the current state map", which involves identi- fying the bottlenecks within the process, and comparing the capacity and demand, as well as exploring the flexibility of the process, with the main focus on waste. Finally, the third phase focuses on "creating the future state map" [16]. Tying this to the automotive industry, Bhamu et al. conducted a case study in the Indian automotive industry, where they focused on how VSM can enhance the performance of the production by focusing on key lean metrics like lead time and quality defects. Through mapping and future state planning, they were able to align the demand, follow lean principles, and allow continuous improvement. Their efforts resulted in major improvements such as reducing lead time by ~20.97%, increasing value by 27%, and improving first time throughput in almost all processes [4]. 20 4 Methods 4.1 Research Design and Approach The main research design for this study is a case study, focusing on the two main projects mentioned earlier. This design approach was taken as it allowed a better in-depth exploration of the real-world context in which AI is being implemented [14]. Moreover, these two projects were highly context-based, with direct connection to specific workflows, processes, and organizational structure. An iterative approach was used to build the model and framework for this study. This ties nicely with Hart- ley’s Book, "Essential Guide to Qualitative Methods in Organizational Research", where she highlighted the value of case studies in exploring complex organizational processes within their real-life contexts and further underscores the flexibility of case study research in allowing multiple data collection methods, including interviews, observations, and document analysis [14]. This iterative approach was carried out by conducting literature reviews to discover what possible methodologies and metrics exist, breaking them down, and attempting to use them. If existing methods did not fit the specific context of this study, they were either adapted by reusing or modifying relevant components or omitted entirely when they proved incompatible. This approach was taken to evaluate the impact of AI within a software development context, leading to the proposal of a structured yet flexible five-phase methodology to guide the modeling of the impact by quantitative and qualitative means. This methodology also considers an automatability factor in which there may be a possibility of automating the modeling process. Creswell highlighted the importance of integrating both qualitative and quantitative methods to enrich the research findings [8]. Furthermore, he discusses iterative approaches surrounding the idea of conducting data collection and analysis cyclically in order to increase the study’s validity and reliability, aligning with my research approach [8]. Figure 4.1 shows an overview of the research design where, during the design phase, the research questions were formed based on the research gap and aligned with the company’s goals. Then, to answer RQ1, literature reviews were conducted, and the findings were formed based on some informal conversations with some of the stake- holders. As for RQ2-4, they were answered by first conducting literature reviews 21 4. Methods and forming the base of the framework using the literature review findings, as well as engaging in interviews and conversations with the different stakeholders and the de- velopers. Once the base was formed, an iterative process was conducted. It started with literature reviews and then testing the possibility of applying the methods by exploring the available data, as well as contrasting with the stakeholders’ and devel- opers’ interests based on the previous interviews and their expressed support. With that done, the method was considered established, and then the data collection pro- cess started, where interview/survey rounds were conducted if possible. Once that was done, the results were analyzed, and the process was repeated. Design Phase RQ1 RQ2-4 Iterative Optional Forming Form base of Test Research framework possibility of Surveys Questions and roadmap method Research Literature Literature Establish Analyse Literature Literature Data Gap Reviews Reviews Review Reviews Accessibility Method Results Company Stakeholders and Stakeholder Interest and InterviewsDeveloper Goals Engagement Support Figure 4.1: Research plan overview. 4.2 Theory 4.2.1 Intertwined quality and Efficiency When considering the idea of quality and efficiency, it can be quite challenging to focus on one and not the other, and that is due to the interconnected relationship be- tween them. In most cases, quality is considered the foundation of an organization’s survival. Xu stated that "Quality is the guarantee of efficiency" and "Efficiency is the benefit of quality", highlighting the fact that efficiency is useless without quality, essentially meaning that they are complementary and not mutually exclusive [31]. In this context, efficiency refers to the effective use of time, effort, and resources in software development processes, such as minimizing rework, reducing testing time, or optimizing workflows, while still maintaining or improving output quality. Due to this, any efficiency-related or quality-related factor within the different software development processes fell under the quality category and were weighed equally. 4.2.2 Surveys and Interviews As part of communicating with stakeholders and developers and highlighting the key factors, the focus of the interviews and surveys covered a few main aspects: Understanding The Process The main purpose was to understand how the manual processes work and what the stakeholders’ teams do. Furthermore, it aimed to gain an understanding of the solution developers’ efforts, perspectives, and what they do or plan to do. This 22 4. Methods information was then used to break down and categorise the different workflows in the processes. This was mainly done by conducting semi-structured interviews. Quality and Efficiency Factors The main focus here was on identifying the key factors and sources contributing to both high and low quality, as well as high and low efficiency, within the workflows and processes used by the stakeholder team. This was done by conducting surveys and/or interviews, with the addition of informal conversations to clarify certain factors if needed. Prioritization using $100 method The focus here was on prioritizing and ranking the different factors for each area (high/low efficiency and high/low quality) using the $100 method, where each person got to split $100 over the factors from each quality category. This was achieved purely through conducting surveys. Mapping Factors and Gap Analysis The focus of this round of interviews was on mapping the identified factors to the major/main components of the AI solution. Moreover, identifying and understand- ing the reasons behind which factors were not targeted and why. Automation Factors The main focus was on the solution developers, with the goal of understanding what factors were considered when developing their solution to decide whether they should use traditional scripting or LLMs to automate specific steps in the process. 4.2.3 Factors and Issues Prioritization Following the Pareto method, the idea of the 80/20 rule [24], where in this case 80 percent of the problems can be traced to 20% of the causes. This was the main form of prioritizing what quality issues needed to be mitigated the most and what quality factors were the most important to keep or improve. This helped provide simple but clear goals and targets for both designing and improving the AI solution, as well as a way to evaluate the efforts to mitigate or improve those factors. 4.2.4 Software Process Improvement Initiatives (SPII) By extracting the different project components and using the process improvement initiative approach, following a similar path to Slaughter et al. [28], it is possible to directly evaluate the efforts of such initiatives or feature/functionality against different tangible outcomes, such as time, cost, quality factors or process specific metrics. 4.2.5 Automation Levels and risk By exploring and understanding the projects and how they fit into the current manual processes, it was possible to use the DAnTE taxonomy to estimate the 23 4. Methods automation level of the projects [18]. Moreover, by looking at Feldt et al’s risk taxonomy, it was possible to understand and highlight the general risk level of applying such automation levels in the specific points of application [10]. The degree of automation from the DAnTE taxonomy was used here instead of Feldt et al’s taxonomy, as based on the descriptions, it was more suitable for the projects, as well as being more concentrated on LLMs, which related strongly to the AI technology being used in those projects [18]. 4.3 Framework This section provides a general overview of the five phases, a description of each of its subsections, and a subjective measure through a three-level automatability level (Low, Medium, and High) based on how automatable the process is, including data collection and analysis: 1. Understanding the current process (Medium) This phase sets the baseline and enables understanding of how the current process operates without any external intervention. a. Stakeholder engagement (Medium) Engage with developers, managers, and product owners to gain a better under- standing of the current processes, workflows, and how everything fits together to collect insights about pain points, team structure, and communication pat- terns. b. Solution developer insights [if under development] (Medium) If there is a solution currently under development, gather feedback from those developing the AI solution to gain a different perspective on the processes. c. Workflow breakdown (Low) Based on the gathered information, deconstruct and break down the processes and workflows into distinct steps to identify each step separately. 2. Identifying and prioritizing the problem (High) This phase focuses on discovering pain points and factors that relate to the quality of the workflow. a. Stakeholder insights (Medium) Engage with developers, managers, and product owners to gain a better under- standing of the problems and pain points in the current processes and work- flows. Collect qualitative insights to understand how current issues affect de- veloper productivity and software delivery. b. Identifying key factors (High) Extract the sources or factors that lead to high and low quality and efficiency in the current workflows and processes. 24 4. Methods c. Data collection (High) Collect data related to the factors from the previous step and use analytical tools or prioritization methods to prioritize the most prominent factors. 3. Breaking down the AI solution (Medium) This phase breaks down the AI solution in a structured manner in order to under- stand and/or predict the effects on the current workflow. a. Solution developer insights (Medium) Engage with the developers of the AI solution to gain a better understanding of the solution’s features and development plan. b. Solution breakdown (main components) (Medium) Identify the core components of the AI solution. c. Assessing constraints and limitations / risk (Medium) Highlight the constraints and limitations of the data accessibility and the AI solution, as well as understand the risks tied to the different automation levels and usage of AI. 4. Aligning the solution to the problem (Low) Map the features of the AI solution to the problems identified previously to evaluate the direct impact of the AI solution on the newly adjusted process. a. Mapping of prioritized factors and solution components (Medium) Establish connections between the prioritized factors and the different identified components of the AI solutions. b. Gap analysis (Low) Compare the proposed and the current process and identify whether or not the AI solution has mitigated the identified issue. Furthermore, understand the reasoning behind the factors not being targeted. 5. Analyzing the impact (Medium) Evaluate the impact of implementing and integrating the AI solution with the cur- rent process to form the new process. a. Assessing outcomes (High) Collect and analyze the quantitative and qualitative data regarding the quality and efficiency factors from surveys and interviews b. Result compilation [if developed or mature enough] (High) Assess the results of the solution regarding its expected outcome. c. Pre and Post comparison [if applicable] (Low) Compare relevant metrics and/or KPIs (Key Performance Indicators) before and after AI integration to assess causal impact. 25 4. Methods Figure 4.2 shows the general overview of the different phases, the steps within them, as well as the general flow. 1. Understanding the current process 2. Identifying and prioritize the problem Solution Stakeholder Workflow Stakeholder Identify Key Data Prioritization Developer Engagement Breakdown Insights Factors Collection and Analysis Insights 4. Aligning the solution to the problem 3. Breaking down the AI solution Mapping of Assessing Solution Solution Gap Analysis Prioritized Solution Constraints and Breakdown Developer Limitations / Risk Factors Components (Components) Insights 5. Analyze the impact Result Pre-Post Assess Compilation comparison Outcomes [If applicable] [If applicable] Figure 4.2: Framework workflow overview 4.4 Data Collection To understand the processes and utilize the proposed methods and metrics, five main rounds of surveys and/or interviews were conducted. These five rounds involved dif- ferent stakeholders and solution developers. These stakeholders’ roles ranged across managers, product owners, and software testers (stakeholder teams), while the solu- tion developers were the developers currently developing or previously contributing to the development efforts of the AI solution. The approach of using interviews as one of the main data collection methods enabled collecting deep and qualitative insights about the different stakeholder perceptions regarding the current processes and goals of the solution. In contrast, surveys were used to collect quantitative data in a simple manner, covering a broader perspective, typically complementing the interviews, which allowed for the uncovering of different patterns and themes. To achieve this, all the components of the survey allowed for free text input. To reach the stakeholders for interviews and surveys, I was initially referred to five stakehold- ers involved with the CS testing tool and two stakeholders associated with the API testing tool by the project developers. Since these individuals were selected based on their relevance and involvement in the respective tools, this is considered purposive sampling. These initial stakeholders then either nominated additional participants for the other rounds of interviews and surveys or provided contact lists from within their teams, following a snowball sampling method. As for the developers, all in- dividuals involved in the development of both tools were contacted directly, which 26 4. Methods follows a census sampling approach, as the intent was to include the entire relevant developer population. Round 1 In order to understand the processes and stakeholders’ goals, semi-structured in- terviews were conducted with various stakeholders, different teams involved in CS- Testing and API-Testing, as well as solution developers involved in the development of the CS-Testing tool and the API-Testing tool. Informal conversations also took place when needed, specifically with the solution developers to better understand the different processes and solutions, as they tended to have a detailed overview and understanding of the processes currently being employed and the projects being developed. Round 2 In order to gather the different sources and factors related to high/low quality and efficiency within the different teams’ workflows and processes, structured interviews and surveys were conducted, with the addition of follow-up conversations for clar- ification purposes. The main questions asked in the interviews and surveys can be seen in Appendix A.3. Round 3 The factors deduced from the previous round of surveys and interviews were ana- lyzed, cleaned up, and summarized into a maximum of 10 bullet points per question for the first four questions. As this round’s focus was on using the $100 method to prioritize the efficiency and quality factors, conducting surveys was the most suit- able for this method. This also targeted members from the stakeholder teams as well as the solution developers, due to the fact that they had extensive knowledge and understanding of the different processes used by the stakeholder teams. The $100 method asked participants to distribute a hypothetical $100 across a set of items to reflect their relative importance, highlighting the main factors. This helped with identifying the key factors by having respondents rank items within the following four main categories: • High Efficiency • Low Efficiency • High Quality • Low Quality Round 4 This round only conducted interviews due to its main focus being on mapping the factors to their respective components from the AI solutions. This targeted the solution developers of each of the tools. The interviews followed an open-ended format to allow participants the flexibility to express their thoughts freely. 27 4. Methods Round 5 Since this round’s focus was on understanding what factors the solution developers consider when choosing the automation technique, either surveys or interviews could have been used. However, semi-structured interviews were the most suitable as they allowed for capturing depth and exploring the larger context of those factors. Furthermore, this allowed for a better understanding of how solution developers think. 28 5 Results 5.1 Interview and Surveys 5.1.1 Round 1 CS-Testing During this round, interviews with five members from the two stakeholder teams involved in CS-Testing were conducted. Their roles comprised managers, specialists, and central members within the processes, allowing the capturing of crucial steps and workflows within their processes in order to understand and explore the processes. Furthermore, discussions regarding their goals and needs stemmed from issues and great annoyances, highlighting the importance of quality and efficiency within their processes. Table 5.1 shows the interviewees’ roles and which team they are a part of. Table 5.1: Manual CS-Testing Teams and Roles Team Role 1 Group Manager, Manage Electrical Engineering 1 Senior ESW Application Engineer 2 Specialist ESW Application Engineer 2 Experienced ESW Application Engineer 2 Specialist System Verification Engineer ESW: Embedded Software API-Testing For this process, interviews with two members from the stakeholder team were conducted. The roles interviewed were a manager and a crucial member within the testing workflow. This also allowed the capture of essentials within the process. Similarly to CS-Testing, their focus was also on quality and efficiency. Table 5.2 shows the interviewees’ roles. 29 5. Results Table 5.2: Manual API-Testing Teams and Roles Role Senior ESW Application Engineer Experienced ESW Application Engineer ESW: Embedded Software 5.1.2 Round 2 For this round, one project collected data through surveys while the other project collected data through interviews. Although the data collection methods differed between the projects, they are still comparable due to the fact that both the in- terviews and surveys focused on the same five questions, which are presented in Section 4.4. CS-Testing For CS-Testing, the stakeholders preferred to participate through surveys instead of interviews. The survey was sent to thirty members, and seven members responded, where the roles comprised managers, testers, product owners, and engineers. Ta- ble 5.3 shows the different roles and experience levels of the interviewees as well as their teams. The number of responses per category can be seen in Table 5.4, while Tables 5.6–5.9 show the cleaned and finalized factors. Table 5.3: CS-Testing Teams, Roles and Experience Levels Team Role Experience (years) 1 Product Owner 4+ 1 Senior ESW Application Engineer 3–4 1 Component Level Testing 4+ 1 Associate Engineer 2–3 2 System Verification Engineer 2–3 2 ESW Application Engineer 1–2 2 Experienced SW Verification Engineer 3–4 ESW: Embedded Software SW: Software Table 5.4: Number of factors identified per category - CS-Testing (Raw) Category Number of Factors High Efficiency 7 Low Efficiency 17 High Quality 13 Low Quality 15 30 5. Results API-Testing For the API-testing case, the stakeholders preferred to participate in interviews rather than surveys. The interviews were conducted with two out of fifteen mem- bers from the stakeholder team, holding Experienced Embedded Software (ESW) Application Engineer and testing roles with 1 to 2 years of experience in component- level testing. The number of extracted factors per category can be seen in Table 5.5 while the finalized and cleaned factors can be seen in Tables 5.10–5.13. Table 5.5: Number of factors identified per category - API-Testing (Raw) Category Number of Factors High Efficiency 8 Low Efficiency 16 High Quality 5 Low Quality 8 5.1.3 Round 3, 4, and 5 Round 3 - CS-Testing A total of 3 surveys took place, including 2 testers with 2 years of experience in testing from the stakeholder teams and 1 solution developer. Round 3 - API-Testing A total of 4 surveys took place, including 2 testers with 1 and 2 years of experience from the stakeholder team and 2 solution developers. Round 4 A solution developer from each of the two projects was interviewed. Round 5 A total of six solution developers were interviewed, and several common factors were identified across them. 5.2 Pareto Analysis The results of the round 3 surveys were collected and analyzed using the Pareto analysis method. 31 5. Results 5.2.1 CS-Testing Pareto Analysis - CS-Testing Tool High Efficiency Pareto Analysis - CS-Testing Tool High Quality 60 100 100 80 80% Threshold 5080 80% Threshold 80 60 40 60 60 30 40 40 40 20 20 20 10 20 0 I II III IV V VI VII 0 0 I II III IV V VI VII VIII IX X 0 Factors Factors (a) CS-Testing High Efficiency (b) CS-Testing High Quality Pareto Analysis - CS-Testing Tool Low Efficiency Pareto Analysis - CS-Testing Tool Low Quality 70 70 100 100 60 60 80% Threshold 80 80% Threshold 80 50 50 40 60 40 60 30 30 40 40 20 20 20 20 10 10 0 I II III IV V VI VII VIII IX X 0 0 I II III IV V VI VII VIII IX X 0 Factors Factors (c) CS-Testing Low Efficiency (d) CS-Testing Low Quality Figure 5.1: Pareto Chart - CS-Testing quality and efficiency factors Table 5.6: Legend for CS-Testing - High Efficiency Factors I Ability to write better test cases faster. II Prioritizing tasks and using time-blocking techniques. III Utilization of [Internal Documentation Tools] for functional understanding, workflow analysis, and dependency mapping. IV Workflow design and mapping to identify bottlenecks and create standard operating procedures (SOPs). V [Internal Documentation Tool] collaboration view and verification tools aid in test case creation and automation. VI Availability of test rigs. VII Clear scope definition for [Specific ECU] component level SW Release and Regression. 32 Count Count Top 20% Factors Top 20% Factors Cumulative % Cumulative % Count Count Top 20% Factors Top 20% Factors Cumulative % Cumulative % 5. Results Table 5.7: Legend for CS-Testing - High Quality Factors I Attaching test cases to requirements in [Internal Documentation Tool] for coverage tracking and documentation. II Ensuring the final product meets stakeholder needs. III Maintaining work and reporting areas in [Internal Documentation Tool] for tracking verification reports and traceability. IV Benchmarking SW Release dates allows sufficient verification time. V Comprehensive testing addressing all project aspects. VI Performing component-level regression with all available test cases for val- idation. VII Comparing results with previous releases for analysis. VIII Maintaining clear and detailed records of processes and changes. IX Robustness and ability to handle unexpected issues. X Regularly checking outputs against requirements and standards. Table 5.8: Legend for CS-Testing - Low Efficiency Factors I Excessive time spent writing test cases. II Difficulty maintaining test cases with updated requirements. III Rewriting the same test cases for different vehicle modes. IV Time spent correcting mistakes or redoing work due to lack of clarity/errors. V Writing repetitive test cases for similar requirements with minor variations. VI Difficulty retrieving necessary information. VII Need for improvement in Requirements Traceability. VIII Dependency on tools for generating test case execution files (e.g., XML). IX Frequent unavailability or issues with validation rigs (e.g., rigs, digital twin errors(VV)). X Slow progress due to limited resources or capacity bottlenecks. Table 5.9: Legend for CS-Testing - Low Quality Factors I Poor traceability hindering the tracking of requirements and changes. II Team errors due to fatigue, misunderstanding, or lack of training. III Neglecting to review processes or outputs. IV Inconsistent reports from regression due to inconsistent test cases or tool timing issues. V Failure to account for unusual scenarios. VI Inadequate quality of edge case test cases. VII Missing edge case testing leading to post-release issues. VIII Missing corner case test cases leading to incomplete test coverage. IX Inadequate testing coverage leaving critical areas untested. X Variations in process quality reducing reliability. 33 5. Results The charts in Figure 5.1 show the Pareto analysis conducted on each of the factors deduced from Round 2 using the results from Round 3. For each of the Pareto anal- ysis charts, a corresponding legend can be found in Tables 5.6–5.9. Having taken this approach, the top 20%, which in most cases are factors I and II, usually have the biggest contribution to the problems experienced in the processes. However, they do not directly hold the Pareto Principle of 80/20, which can be due to two main reasons. The first reason is that the surveys had a very low participation rate, while the second reason could be that the factors and issues faced are more system- atic and widespread, with not much of a clear focus on specific factors, especially considering that different stakeholders’ perspectives do not align. For figure 5.1a, the first two factors contribute to ~60% while the top 2 factors in both figure 5.1b and figure 5.1c contribute to ~50% of the total factors within their respective areas. As for figure 5.1d, it contributes to ~45%. Based on the results, the top factors contributing to high efficiency (figure 5.1a) are "The ability to write better test cases faster" (I) and "Prioritizing tasks and using time-blocking techniques" (II). While the top factors contributing to high quality (figure 5.1b) are "Attaching test cases to requirements in [Internal documentation tool] for coverage tracking and documentation" (I) and "Ensuring the final product meets stakeholder needs". These are the factors that the AI solution must maintain and keep. The top factors contributing to low efficiency (figure 5.1c) are "Excessive time spent writing test cases" (I) and "Difficulty maintaining test cases with updated requirements" (II). For the top factors contributing to low quality (figure 5.1d), the factors are "Poor traceability hindering the tracking of requirements and changes" (I) and "Team errors due to fatigue, misunderstanding, or lack of training" (II). These are the factors that the AI solution must mitigate or fix. 34 5. Results 5.2.2 API-Testing Pareto Analysis - API-Testing Tool High Efficiency Pareto Analysis - API-Testing Tool High Quality 140 100 100 100 120 80% Threshold 80 80% Threshold80 80100 60 60 80 60 60 40 40 40 40 20 20 20 20 0 I II III IV V VI VII 0 0 I II III IV V 0 Factors Factors (a) API-Testing High Efficiency (b) API-Testing High Quality Pareto Analysis - API-Testing Tool Low Efficiency Pareto Analysis - API-Testing Tool Low Quality 100 120 100 80 80% Threshold 80 100 80% Threshold 80 60 80 60 60 40 60 40 40 40 20 20 20 20 0 I II III IV V VI VII VIII IX X 0 0 I II III IV V VI VII 0 Factors Factors (c) API-Testing Low Efficiency (d) API-Testing Low Quality Figure 5.2: Pareto Chart - API-Testing quality and efficiency factors Table 5.10: Legend for API-Testing - High Efficiency Factors I Clear acceptance criteria II Speed of bug detection III Developer-tester feedback loop IV CI scheduling (Nightly regression) V Helper functions for test case writing VI Planning phase for tests VII Optimized tests Table 5.11: Legend for API-Testing - High Quality Factors I Stable runs on regression (non-flaky tests) II Robustness of test cases III Fail test pointers – probable causes IV Uniformity of structure and framework V Report generation for validation 35 Count Count Top 20% Factors Top 20% Factors Cumulative % Cumulative % Count Count Top 20% Factors Top 20% Factors Cumulative % Cumulative % 5. Results Table 5.12: Legend for API-Testing - Low Efficiency Factors I Dependencies on specific roles or information from other teams, causing bottlenecks. II Manual data mapping and integration for test case writing. III Difficulty and time wasted due to missing documentation and inconsistent naming conventions. IV Challenges in backtracking and identifying the root cause of failures in test cases involving multiple API objects. V Impact of varying tester experience and understanding on efficiency. VI Time-consuming process of writing comprehensive edge and invalid test cases. VII Steep learning curve and complexity in understanding and setting up en- vironments (e.g., Rig setup, Virtual Vehicles). VIII Inefficiencies and delays due to manual rig scheduling and low rig availabil- ity. IX Time spent rewriting/modifying and rerunning test cases after API up- dates. X Time spent fixing and redoing tasks due to human errors. Table 5.13: Legend for API-Testing - Low Quality Factors I Missing documentation impacting quality II Incomplete coverage of edge cases and invalid test cases III Low test coverage IV Mapping according to requirements V Lack of a dedicated debug function (tester dependent debugging) VI Not testing error codes (e.g., 404) VII Testers’ time management impacting quality Like the CS-Testing, Figure 5.2 shows the Pareto analysis charts for the API-testing alongside its corresponding legends in Tables 5.10–5.13. In this case, the charts also show a similar trend of not adhering to the Pareto principle. Similarly, different stakeholder perspectives not aligning can be a reason as well as the issue of a very low sample size participating in the surveys. For figure 5.2a and figure 5.2d, the first two factors contribute to ~60% while the top 20% of factors in both figure 5.2b and figure 5.2c contribute to ~50% of the total factors within their respective areas. Based on the results, the top factors contributing to high efficiency (figure 5.2a) are "Clear acceptance criteria" (I) and "Speed of bug detection" (II). While the top factors contributing to high quality (figure 5.1b) are "Stable runs on regression (non- flaky tests)" (I) and "Robustness of test cases". These are the factors that the AI solution must maintain and keep. The top factors contributing to low efficiency (figure 5.1c) are "Dependencies on specific roles or information from other teams causing bottlenecks" (I) and "Manual data mapping and integration for test case 36 5. Results writing" (II). For the top factors contributing to low quality (figure 5.1d), the factors are "Missing documentation" (I) and "Incomplete coverage of edge cases and invalid test cases" (II). These are the factors that the AI solution must mitigate or fix. 5.3 Software Process Improvement Initiatives (SPII) Due to confidentiality, the actual components will not be displayed. However, they will be displayed with substitute aliases of "component #". An idea of what a component is can be seen in the example of a system where there is a major part that generates code using AI, labeled as "Code generator". Figure 5.3 shows a basic idea of the major components of both projects without stating the actual components: Internal Documentation Component 2 Internal Tool tools and DB Component 1 Component 3 Component 4 Component 5 Figure 5.3: Main AI components of the CS-Testing tool and the API-Testing tool. The primary purpose of breaking the system into different components is to enable mapping and visualizing the efforts involved in developing the components and their impact on various quality and efficiency factors identified through surveys. Further- more, categorising them into factors that the components fix and those that the components will keep or improve. This was done during Round 4 of interviews, where the solution developers of each of the tools were interviewed with the focus of mapping each of the factors to their respective components. The mappings and the contributions of the components to the factors are displayed in Figure 5.4. 5.3.1 CS-Testing Tool Factors Fixed Based on Figure 5.4a, it is evident that for the Efficiency factors, the most impactful component is Component 1, fixing a total of six issues faced by the stakeholders within the manual CS-Testing process. While for the Quality factors, Component 3 fixed the majority of the issues, targeting a total of five factors. Overall, this tool targeted 20/20 of the factors that needed fixing. Factors Kept/Improved Based on Figure 5.4b, for the Efficiency factors, Component 1 targeted the most factors, with a total of 4 factors. While for the Quality factors, Component 5 targeted the most factors with a total of 6 factors. However, overall, the CS-Testing 37 5. Results CS-Testing - Factors Fixed CS-Testing - Factors Kept 10 Efficiency 10 Efficiency 8 Quality 8 Quality 6 6 4 4 2 2 0 0 t 1 t 2 t 3 t 4 t 5 t 1 t 2n n n n n n n nt 3 t 4 t 5 ne ne ne ne ne ne ne n n o o o o o o o on e ne ne mp mp mp mp mp mp mp mp mp o mp o Co Co Co Co Co Co Co Co Co Co (a) CS-Testing Fixed Factors. (b) CS-Testing Kept Factors. API-Testing - Factors Fixed API-Testing - Factors Kept 10 10Efficiency Efficiency 8 Quality 8 Quality 6 6 4 4 2 2 0 0 t 1 t 2 t 3 t 4 t 5 t 1 2n n n n n n nt nt 3 t 4n nt 5 ne e e e e e e e e eo on on on on on on on n n mp mp mp mp mp mp mp mp mp o po Co Co Co Co Co Co Co Co m Co Co (c) API-Testing Fixed Factors. (d) API-Testing Kept Factors. Figure 5.4: SPII vs Quality and efficiency factors. tool missed one of the Efficiency Factors, meaning that the tool only targeted 16 out of 17 factors that needed to be kept. 5.3.2 API-Testing Tool Factors Fixed Figure 5.4c shows that Component 2 contributed the most to fixing the Efficiency factors, fixing a total of 4 factors. While Component 1 was fixed, 6 of the Quality factors contributed the most in this category. The API-Testing tool was able to target all the factors that needed to be fixed, totaling 17/17 factors. Factors Kept/Improved Figure 5.4d shows that for the Efficiency factors, Component 1 contributed the most by keeping or improving three factors. While for the Quality factors, both Component 2 and Component 4 contributed equally as much, with a total of 2 factors each. However, similar to the CS-Testing Tool, it did not target every factor, missing two factors, both related to Efficiency. Overall, the API-testing tool was 38 Cumulative Factor Count Cumulative Factor Count Cumulative Factor Count Cumulative Factor Count 5. Results able to target 10 out of 12 factors in this category. 5.3.3 Gap Analysis Although the factors may be traced directly to the sources (Components), SPII does not highlight the reasoning behind it, and further exploration may be needed in the form of a gap analysis. In this case, this was simply done within Round 4 of the interviews, where the solution developers were able to highlight the reason behind the tools not being targeted by the tools. The reason why they have not been targeted is due to the fact that they were either restricted by company policies or controlled by another internal third party. Moreover, the factors that were valued the highest in the Pareto analysis have all been targeted by their respective AI solutions. 5.4 Automation Factors Interviews conducted with the solution developers showed a near-unanimous agree- ment regarding which factors are considered when deciding to use programmable LLM modules or traditional scripting as the main method of automation for the dif- ferent steps of the process. Furthermore, some of the factors were backed by previous research based on Polymer as explained by Parthasarathy et al [20]. Those factors were then categorised and defined based on the interview responses. The factors can be seen in table 5.14 alongside the possible options and guiding questions. 39 5. Results 40 Table 5.14: Factors considered when automating Category Factor Description Options Formatting Formatting Context understanding Does the step require interpreting meaning beyond explicit inputs? Yes, No Semi-structured language Does this step involve inputs that mix both structured elements (e.g., Unstructured, support code, configuration) and unstructured components (e.g., natural lan- Structured, Semi- guage descriptions)? structured, Human Formalization Required Does this step require converting these specifications into machine- Yes, No executable formats? Structured data input Is the data input for this step consistently structured and predictable? Yes, No Replicable patterns/boil- Can this step be standardized using predefined templates or reusable Yes, No erplates patterns? Dynamic interaction re- Does this step require adaptive exchanges based on changing inputs or Yes, No quired interactions with constantly updating sources? Discriminative Activities Discriminative Context understanding Does this step depend on understanding broader context for accurate Yes, No Activities classification? Judgement Does this step involve subjective assessments that might require human Yes, No, Human judgement? Reasoning/Decision- Does this step require logical analysis or context-based decision making? Yes, No, Human making Generative Activities Generative Context understanding Does this step rely on capturing and maintaining context for coherent Yes, No Activities outputs? Deterministic outcomes Is the output of this step expected to be predictable and identical for Yes, No the same set of inputs every time? 5. Results Automation LLM Manual Requirement Automation Context Understanding Yes No Judgement Human Yes Required No Reasoning / Decision Human Yes Making Required No Semi- Structured Semi-Structured, Human Language Unstructured Support Structured Formalization Yes Required No Structured No Data Input Yes Replicable Patterns / No Boilerplates Yes Dynamic Interaction Yes Required No Deterministic No Outcomes Yes Traditional Scripting Automation Figure 5.5: Decision Tree Visualization 41 5. Results Figure 5.5 is based on the interviews with the solution developers in order to under- stand a part of their thought process when deciding on the automation method to use. Even though this decision tree may help developers decide whether to choose traditional scripting, AI, or even having a human in the loop, it can help highlight the direct value of using AI on a lower level, where feasibility and effort are affected by the specific step. Furthermore, it highlights the direct impact of using AI instead of traditional automation, differentiating between the two. A decision tree in this case was used as it can help with generalizing by making it reusable for different cases and scenarios. Furthermore, the decision tree is flexible, allowing shuffling the positions of the rank, in the case where a developer might prioritize the factors dif- ferently based on preference, requirements, and capability. The way the tree works is by first going down the first node which is context understanding, this node is related to the three main categories: Formatting (Purple), Discriminative Activities (Blue) and Generative Acitivities (Teal) each having a different description, hence, it is marked with the three different colours. Then, going down the relevant path, so in this case, if context understanding is needed in any of the three mentioned cat- egories, the path would follow "yes," meaning that the LLM automation is needed, highlighting that context understanding is a benefit of using AI for that specific step. 5.5 Automation Levels and Risk Having considered the different levels of automation from the DAnTE taxonomy, it is clear that the level of automation of both projects is a level 4 "Global Generator" [18]. This is due to the fact that automates workflows, including generation of code, test suites, and documentation, with the inclusion of humans who review, approve, and may correct/refine the generated output. The perceived benefit of such automation, according to their research, is increased productivity, reduction of manual efforts, and reduction of errors [18], which relates to the factors and issues mentioned by the different stakeholders in the interviews and surveys. Following Feldt et al’s taxonomy [10], the point of application varies depending on the scale of what is considered the "application". By looking at the tools from the perspective of the whole process, the point of application would be process level, which signifies lower risk and negative impact. However, when only looking at the projects, the point of application of AI would be on the product level, which would signify higher risk levels. Additionally, considering the automation level of 4 on the DAnTE scale [18], even higher risk levels and negative impacts may be present according to Feldt et al.’s research [10]. Hence, safeguards may need to be considered. 42 6 Discussion 6.1 Addressing Research Questions 6.1.1 Research Question 1 RQ1: How is "impact" defined in a software development context? Based on literature reviews and interviews with multiple stakeholders, the term “impact” is defined as: "The qualitative and quantitative changes and/or consequences across processes and outcomes within the software development lifecycle. It constitutes the alteration or creation of value in its different stakeholder-defined forms, as a direct or indirect result of the integration and use of AI as a software development solution." This definition highlights that impact does not only encompass measurable data but also descriptive data. Furthermore, it highlights that impact can come in different forms, such as direct and indirect. Moreover, having conducted interviews with the stakeholders, a specific technical and business "impact" focus is possible based on the available data and restrictions that are in place. 6.1.2 Research Question 2 RQ2: What impact metrics are applicable and how can these metrics be categorized and prioritized? In order to answer this question, conducting literature reviews is essential. By doing so, it is possible to explore different metrics through investigating different methodologies. Having conducted interviews to gain an understanding of what the stakeholders want, it is possible to find the methodologies and metrics that will align with them as well. Taking an iterative approach of exploring the methods, testing them, and adjusting accordingly allows you to categorise the different metrics, rule out those that are not feasible, and identify the important aspects relevant to the stakeholders. There exists a great limitation involving these different cases in terms of product maturity and company data accessibility. Hence, a structured approach that considers the constraints of the study and the dimensions of the software pro- cesses is required. 43 6. Discussion The metrics are categorized into four main categories: Significance Weighting Met- rics, Factor-Specific Metrics, Component-Specific Metrics, and Classification Met- rics. Each plays a major role within their respective methods, and more importantly, showcases how the AI solutions impact the software development processes. The or- der in which these metrics are stated follows their prioritization. Starting with the Significance Weighting Metrics, this approach enables the prioritization of dif- ferent factors based on stakeholder-perceived significance, further highlighting the strengths and weaknesses of the processes and facilitating the use of Factor-Specific Metrics. Factor-Specific Metrics offer a way to quantify the breadth of the factors related to the different quality categories (high/low quality and efficiency), as well as to quantify those that are addressed by the AI solution. Similarly, Component- Specific Metrics allow traceability to the specific components of the AI solution, showcasing the contribution of the individual components and expanding the evalu- ation of the AI solution’s effectiveness. Finally, the Classification Metrics provide a quantitative approach to the qualitative attributes and information regarding how automatic the AI solution is. 6.1.2.1 Significance Weighting Metrics $100 Weights Represents the relative importance of each factor as assigned by participants using the $100 prioritization method. It is used to identify which factors are perceived as most critical to quality and efficiency. This is done by the participants distributing $100 across a set of factors within each category to reflect their perceived significance. 6.1.2.2 Factor-Specific Metrics Total Number of Factors Highlights the number of factors around high/low quality and efficiency, which allows the user to gauge the maximum number of factors that can be targeted. Number of Factors Per Quality Category Highlights the number of factors within a specific quality category (i.e., High Quality, Low Quality, High Efficiency, Low Efficiency), which allows the user to gauge the maximum number of factors that can be targeted within their quality category. Total Number of Targeted Factors Highlights the number of factors the AI solution targets, essentially eliminating the negative factors, or keeping and/or improving the positive factors around high/low quality and efficiency. Number of Targeted Factors Per Quality Category Highlights the number of factors the AI solution targets within a specific quality cat- egory (i.e., High Quality, Low Quality, High Efficiency, Low Efficiency), essentially eliminating the negative factors, or keeping and/or improving the positive factors within their quality category. 44 6. Discussion Total Factor Coverage Using the above-mentioned factor metrics, it is possible to calculate the percentage of factors targeted by the AI solution. This is denoted as: Factor Coverage = Total Number of Targeted FactorsTotal Number of Factors × 100% Factor Coverage Per Quality Category Using the above-mentioned factor metrics, it is possible to calculate the percentage of factors targeted by the AI solution for each quality category. This is denoted as: Factor Coverage Per Quality Category = Number of Targeted Factors Per Quality CategoryNumber of Factors Per Quality Category × 100% 6.1.2.3 Component-Specific Metrics Component Contribution Having each targeted factor mapped to its respective component, it is possible to calculate the percentage contribution of each component to the total number of targeted factors. This is defined as: Component Contribution = Number of Factors Targeted by Componentii Total Number of Targeted Factors × 100% 6.1.2.4 Classification Metrics Automation level By using predefined taxonomies and classification frameworks such as DAnTE [18], automation-level classification emerges. The main purpose of this metric is to pro- vide quantifiable value to qualitative descriptions of software automation. This is done by either understanding the capability of the tools or discussing the different automation level descriptions with the solution developers to reach a consensus. This directly connects with the understanding of direct impact, where automation is seen as a positive outcome. Moreover, this automation level value clarifies the role of both the human and the AI within the software development context. 6.1.3 Research Question 3 RQ3: What methodologies can be employed to measure the prioritized impact metrics both quantitatively and qualitatively, and how can these methodologies be applied to practical cases involving generative AI-driven solutions within software development? 45 6. Discussion In some cases, certain methods related to costs, effort, and time, or even results pertaining to the study cases, would be employed here to answer the research ques- tion. However, that’s not the case due to the company restrictions and the stake- holder focus. A table including a general overview covering the context, description, strengths, and weaknesses can be seen in Appendix A.1 for the unutilized methods and in Appendix A.2 for the utilized methods. To answer this research question, a more qualitative and quantitative approach is taken, focusing more on stakeholder- defined priorities and process analysis. 6.1.3.1 Overview Interviews and Surveys Interviews and surveys serve as the main tools for gathering data, allowing the capture of both quantitative and qualitative data. This helped design the structure of the research, as well as understand the different processes and direct the aim towards quality and efficiency, and the different factors related to them. Pareto Analysis and $100 Method By allowing stakeholders to allocate fictional $100 across the different factors, it allowed quantification of the importance of the different factors and the possibility to conduct a Pareto analysis. Here, the $100 weights are used to show the individual perceived significance. Pareto Analysis highlights the important factors causing most of the process issues as well as the ones that actually help the process, from both the quality and efficiency perspectives. It also helps draw conclusions from the survey and interview data, providing both quantitative and qualitative results by offering a prioritized list of factors where each carries its own qualitative value while also contributing quantitative data by showing their relative significance in terms of frequency and impact. Conducting a Pareto Analysis allows the usage of the factor- specific metrics to highlight which areas should be prioritized for AI intervention and which components of the solution align most closely with the critical issues or strengths identified. SPII Software Process Improvement Initiatives provide a mix of quantitative and quali- tative value. It allows decomposing the major parts of the AI solutions into specific components, allowing the traceability and mapping of each component as a source of impact point. In this case, it is the specific factors the solution highlights or issues it fixes/mitigates. In other cases, mapping the components to cost, time, effort, and other feasible metrics would be possible. Furthermore, this enables the usage of the component-specific metrics to take place, highlighting the relative contribution of each component to the overall effectiveness of the AI solution, and enabling a more detailed assessment of which parts deliver the most value in addressing spe- cific process inefficiencies or quality improvements. This can help highlight which components deliver the most value and justify development and investments. 46 6. Discussion Automation Taxonomy and Risk This helps classify the level of autonomy where automation in itself is considered an impact, as well as classifying the point of application, which together allows the developers and stakeholders to gain a high level of perceived risk that comes with automation. By following the DAnTE framework[18], it is possible to use the automation level classification metric, providing the quantitative value of how automatic the software is, further providing qualitative information clarifying the human and AI roles. Combined with the AI-SEAL[10] risk diagram as shown in Figure 3.1, it can provide qualitative information related to the perceived risk of automation. 6.1.3.2 Tying It All Together As a collective, these methodologies answer RQ3 as they address it through com- bining prioritization, structured process analysis, component-level traceability, and a range of metrics to achieve both quantitative and qualitative evaluation of the im- pact of AI in real-world software development processes. Together, this showcases what methodologies can be employed to measure the impact metrics both quanti- tatively and qualitatively, and how they can be applied to practical cases involving generative AI solutions within software development contexts. 6.1.4 Research Question 4 RQ4: How can these impact metrics be modeled to address the needs of target stakeholders and support their decision-making process when it comes to the integration of generative AI in software development pro- cesses? To support stakeholder decision-making regarding the integration of AI in software development, this study introduces a structured, yet flexible, approach to using and exploring impact metrics and methods rather than following abstract and generic metrics and methods. The model is built from the bottom up using literature reviews, process-specific considerations, and practitioner insights, in both a quanti- tative and qualitative way. The model follows the idea that impact does not have a universal meaning but is defined by stakeholders, allowing the usage of context- dependent factors instead of abstract terms like "effort saved". The proposed five-phase modelling framework is a structured model that guides practitioners with the analysis and interpretation of impact, as well as measuring and modelling it. This framework can be seen in Section 4.3. The first phase allows one to gain the base knowledge of how the processes work and the needs of the stakeholders, highlighting the focus point of modeling the impact of AI. The second phase focuses on understanding the issues that the stakeholders face within their current processes, in this case, leading to the gathering and analysis of the high and low quality and efficiency factors. Then the third phase focuses on breaking down the proposed AI solution by communicating with the solution developers to understand how the solution works and identifying the main components of the solution. Once that is done, the fourth phase focuses on aligning the solution and problem by 47 6. Discussion mapping the identified factors and the main components, as well as extracting the untargeted factors and understanding the reason behind them. With that done, the fifth phase focuses on analyzing the impact by assessing the outcomes as well as evaluating the results of the solution in relation to its expected outcome, considering whether the solution is developed or mature enough to achieve it. Moreover, during this phase, the old and the new processes (with the solution) are compared through relevant metrics and KPIs. By going through with this framework, in this case, the main outcomes involve the Pareto Analysis, SPII, and Automation Tree. 6.1.4.1 Pareto Analysis and Diagrams Conducting a Pareto Analysis allows understanding of what the emergent and most prominent issues and factors are from the perspective of the stakeholders. This allows the stakeholders and solution developers to direct the focus towards those factors that are prioritized higher. Furthermore, it also follows the idea that the top 20% of the causes relate to 80% of the problem, while it may not be exactly 80%, it is still a significant percentage of the problems. Moreover, in this study, conducting a Pareto Analysis results in the creation of four Pareto Charts per process, each targeting the following: High Efficiency, High Quality, Low Efficiency, and Low Quality factors, simplifying and clarifying the results. 6.1.4.2 SPII Breaking down the solution into major components allows for the use of the SPII, where each component can be considered a process improvement, as they may work as standalone solutions, providing a positive impact on the process. Having done Pareto Analysis previously, the factors can be mapped to the different solution components, resulting in diagrams like the ones shown in Figure 5.4. Furthermore, it is possible to model other metrics with SPII, such as cost, effort, or quality-specific metrics like defect density, as shown in the research conducted by Slaghter et al [28]. This can further guide stakeholders with decision-making, considering an AI-based solution, when it comes to the stakeholder focus that was not captured in this study. 6.1.4.3 Automation Decision Tree This allows us to capture the direct impact and intent of using AI at a lower level by considering developers’ reasoning and decision-making when choosing traditional scripting versus AI, as well as providing the qualitative value of AI. Furthermore, this can help developers and stakeholders to decide when it is appropriate to use AI or traditional scripting, or even to do the process manually. This is done through a flexible decision tree as shown in Figure 5.5, as well as the supporting questions in Table 5.14. 6.1.4.4 Tying It All Together This study demonstrates how different metrics and methods can be combined to form a model that can be contextualized and operationalized to support stakeholder decision-making by applying the five-phase modeling framework with support from 48 6. Discussion Pareto Analysis, SPII, and the Automation Decision Tree. This approach avoids re- liance on abstract or generic measures and instead tailors it towards process-specific issues and the AI solution components, providing traceability and ensuring relevance as well as navigating around organizational restrictions. This also allows stakehold- ers to identify which situations AI brings value to and can have a net benefit in, jus- tify investments, and make structured, evidence-based decisions. Linking this study to the broader software engineering lifecycle, the results from the case studies can be generalized by focusing on the adaptability of the five-phase modeling framework, which is not limited to specific tools or domains. Its emphasis on process-specific metrics, stakeholder alignment, and decision support makes it applicable across var- ious software development contexts where generative AI is being considered. While the specific findings are context-dependent, the underlying approach offers a flexi- ble structure that can be tailored to different teams, workflows, and organizational settings within software engineering. Therefore, RQ4 is addressed by providing a structured yet flexible model for aligning generative AI solutions with real-world software development context and needs. 6.2 Future Works 6.2.1 Result Compilation and Pre and Post Comparison Within the final phase of the proposed framework, the result compilation and pre- post comparisons are labeled as "if applicable." This is due to the fact that certain factors or requirements pertaining to maturity levels and development of the tool need to be fulfilled in order to conduct them, as this was not the case in the studied cases. Further research needs to be conducted to investigate and explore the causal impact of the AI tools in practice. This would involve systematic collection of pre- and post-implementation data around the idea of KPI’s and actual results of the tool, such as time, software quality improvements, and in this case, testing specific metrics. In this case, data collection may take place both prior to the integration of the tool and shortly after its implementation, as well as once stakeholders have become more familiar and comfortable with the new process. 6.2.2 Constraints, Limitations and Risk The framework contains a step in the third phase regarding assessing the constraints, limitations, and risks of implementing AI, which requires a more detailed assessment. Constraints related to data availability, model explainability, organizational readi- ness, and compliance (especially in safety-critical domains) need to be more formally focused on and evaluated. 6.2.3 Automatability Levels of the framework The current use of low, medium, and high automatability levels for classifying the automation level of the framework steps and phases, subjectively, provides a qual- itative estimate. Future research in this case should focus on operationalizing the 49 6. Discussion categories using a more established or a defined and measurable criterion. This would standardize the assessment across the different contexts and enable the op- portunity to model automation in a feasible manner. 6.2.4 Longitudinal Studies and Feedback Loops To expand on this research, a longitudinal study would enable capturing the long- term impact of AI solutions. This can be done through tracking the adoption, performance, and stakeholders’ feedback over a period of time. This will not only help refine the framework and the modeling process but may also identify other emerging values, risks, and consequences not captured in this study. 6.2.5 Expanding to other domains and focuses As this study mainly focuses on, and is validated through testing processes within an automotive context, specifically control system testing and API testing at Volvo Trucks, future research can explore other domains, aspects, and organizations of varying sizes to capture different perspectives and further validate or adapt the framework, enabling the discovery of new findings. 50 7 Validity Threats and Limitations 7.1 Internal Validity 7.1.1 Sampling Bias Low Sample Size One of the biggest internal validity threats is the low sample size, where in some cases, there were very low response rates. An example of this can be seen in Round 3 of interviews and surveys, where only 2 people responded out of 30. This may introduce nonresponse bias, where the results may have been completely different if more people had responded. This may affect the reliability and representativeness of the data. The low participation may have been influenced by resistance to change, skepticism toward AI integration, or a lack of perceived value in the study among potential respondents. It may also have resulted from practical constraints, such as limited availability or a lack of motivation to engage with interviews or surveys. Low Sample Diversity The intention was to interview and sample different roles within the stakeholder team in each round. However, that was not the case, where in Round 3, only testers responded to the surveys, while the surveys and interview requests were sent out to roles ranging from product owners to managers. This may introduce bias or skew the results in favor of the tester’s perspective, which may be more technical and operationally oriented rather than business or managerially oriented, which may affect the representativeness of the different roles. 7.1.2 Research Bias There is a risk of research bias, and that is due to the fact that some of the solu- tion developers participated in the surveys and interviews while helping define the framework through their responses, leading to the creation of the automatability decision tree. This may skew the outcome of the research as the participants may have had an interest or alignment with the success of AI solutions unconsciously. This also breaks the line between data collection and design. 51 7. Validity Threats and Limitations 7.1.3 Social Desirability There is also a possibility of social desirability bias in how stakeholders responded to questions about inefficiencies in manual processes. In some cases, participants may have underreported challenges or inefficiencies to avoid reflecting negatively on colleagues or team practices. Additionally, given the study’s focus on AI-driven solutions, there may have been a fear of job displacement, leading some stakeholders to consciously or unconsciously withhold or downplay information that could sup- port automation. This can limit the depth and honesty of feedback, particularly in areas where AI could be perceived as a threat to existing roles. 7.2 External Validity Since this study mainly focuses on one company and a specific domain of automotive software testing, this may limit the generalizability of the findings. Even though the study attempts to create a flexible and generalizable framework, the foundations of the framework originate from internal processes and tools at said company, possibly making the findings context-specific and requiring additional steps to make it more adaptable. Moreover, the AI solutions are still under development, and with a lack of end-user feedback, certain conclusions on impact are partially speculative. 7.3 Limitations Due to the period during which the research was conducted in relation to the matu- rity and the state of the tools, it was not possible to perform the longitudinal mea- sures related to the pre/post-comparison and the result compilation from the last steps in phase 5 of the framework. Moreover, it would have been possible to collect some project-specific metrics. However, it would not have provided much valuable information because they were constantly improving and having some architectural changes, as well as the fact that there were many restrictions and limitations in- volving collecting data from the stakeholders that could be used for comparison. Furthermore, due to constraints regarding company data, certain planned metrics collection activities were not feasible to conduct, such as metrics related to cost, and even if those restrictions were not there, it would still be a very complicated and non-trivial process to do. Hence, not going through with methods such as the return on software quality. 52 8 Conclusion With a focus on software development processes, specifically software testing, in the automotive industry at Volvo Trucks, this study aimed to propose and explore a structured, stakeholder-centric framework for evaluating the impact of AI-driven so- lutions. By incorporating qualitative and quantitative methods and techniques such as Pareto analysis, Software Process Improvement Initiatives, and Automation de- cision trees alongside Interviews and Surveys, this research provides a methodology to model the impact of AI on software development processes. The development of this five-phase framework enables practitioners to understand existing processes and workflows, identify and prioritize problems, comprehend the proposed AI-based solution, and map the solution to the problem, allowing for the analysis and measurement of the impact of AI within these processes. Further- more, this study defines the term "impact", emphasizing that there is no universal definition, but it is stakeholder-centric. This research not only ensures theoretical relevance but also practical usefulness through the application and investigation of real-world industrial cases. With a focus on identifying and prioritizing quality and efficiency factors, aligning them with the main components of the AI tool, and assessing the external limita- tions, the results show that having a structured AI solution aligning with stake- holder needs can deliver value through measurable improvements, alleviating the major pain points, while preserving the strengths of the current processes. In conclusion, this study lays the foundation and opens the doors for more data- driven and context-related evaluation of AI impact in software development and invites future work to refine, extend, and validate the proposed framework in differ- ent industries. 53 8. Conclusion 54 Bibliography [1] Alain Abran and Pierre N. Robillard. Function points analysis: an empirical study of its measurement processes. IEEE Transactions on Software Engineer- ing, 22(12):895–910, 1996. [2] Aybüke Aurum, S Biffl, B Boehm, H Erdogmus, and P Grünbacher. Value-based software engineering. Springer, 2005. [3] Marco Barenkamp, Jonas Rebstadt, and Oliver Thomas. Applications of ai in classical software engineering. AI Perspectives, 2(1):1, 2020. [4] Jaiprakash Bhamu, JV Shailendra Kumar, and Kuldip Singh Sangwan. Produc- tivity and quality improvement through value stream mapping: a case study of indian automotive industry. International Journal of Productivity and Quality Management, 10(3):288–306, 2012. [5] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. [6] Christophe Commeyne, Alain Abran, and Rachida Djouab. Effort estimation with story points and cosmic function points-an industry case study. Software Measurement News, 21(1):25–36, 2016. [7] COSMIC - Common Software Measurement International Consortium. Cosmic sizing methodology website. Accessed April 10, 2025. [8] John W Creswell and J David Creswell. Research design: Qualitative, quanti- tative, and mixed methods approaches. Sage publications, 2017. [9] RD Emrick. In search of a better metric for measuring productivity of applica- tion development. In Proceedings of Function Point Users Group Conference, 1987. [10] Robert Feldt, Francisco G de Oliveira Neto, and Richard Torkar. Ways of ap- plying artificial intelligence in software engineering. In Proceedings of the 6th International Workshop on Realizing Artificial Intelligence Synergies in Soft- 55 Bibliography ware Engineering, pages 35–41, 2018. [11] Stefan Feuerriegel, Jochen Hartmann, Christian Janiesch, and Patrick Zschech. Generative ai. Business & Information Systems Engineering, 66(1):111–126, 2024. [12] Fiona Fui-Hoon Nah, Ruilin Zheng, Jingyuan Cai, Keng Siau, and Langtao Chen. Generative ai and chatgpt: Applications, challenges, and ai-human col- laboration, 2023. [13] Brian P Gallagher. Interpreting capability maturity model integration (cmmi) for operational organizations. 2002. [14] Jean Hartley. What is a case study. Essential guide to qualitative methods in organizational research, 323, 2004. [15] Alexej Kisselev. Alexej kisselev, jan 2023. Alexej Kisselev - Homepage. [16] Jostein Langstrand. An introduction to value stream mapping and analysis. 2016. [17] Jack E Matson, Bruce E Barrett, and Joseph M Mellichamp. Software devel- opment cost estimation using function points. IEEE Transactions on Software Engineering, 20(4):275–287, 1994. [18] Jorge Melegati and Eduardo Guerra. Dante: a taxonomy for the automation degree of software engineering tasks. In Generative AI for Effective Software Development, pages 53–70. Springer, 2024. [19] Dhasarathy Parthasarathy. Journeys in vector space: Using deep neural net- work representations to aid automotive software engineering. Doctoral thesis, Chalmers University of Technology and University of Gothenburg, 2023. [20] Dhasarathy Parthasarathy, Yinan Yu, and Earl T Barr. Polymer: Development workflows as software. arXiv preprint arXiv:2503.17679, 2025. [21] Mark C Paulk, Bill Curtis, Mary Beth Chrissis, and Charles V Weber. Capa- bility maturity model, version 1.1. IEEE software, 10(4):18–27, 1993. [22] Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. The impact of ai on developer productivity: Evidence from github copilot. arXiv preprint arXiv:2302.06590, 2023. [23] Ameya Shastri Pothukuchi, Lakshmi Vasuda Kota, and Vinay Mallikarjunarad- hya. Impact of generative ai on the software development lifecycle (sdlc). In- ternational Journal of Creative Research Thoughts, 11(8), 2023. [24] Taman Powell and Tanya Sammut-Bonnici. Pareto analysis. 2014. 56 Bibliography [25] Mike Rother and John Shook. Learning to see: value stream mapping to add value and eliminate muda. Lean enterprise institute, 2003. [26] Jaakko Sauvola, Sasu Tarkoma, Mika Klemettinen, Jukka Riekki, and David Doermann. Future of software development with generative ai. Automated Software Engineering, 31(1):26, 2024. [27] Alex Singla, Alexander Sukharevsky, Lareina Yee, Michael Chui, and Bryce Hall. The state of ai in early 2024: Gen ai adoption spikes and starts to generate value. Technical report, McKinsey & Company, 2024. Accessed March 10, 2025. [28] Sandra A Slaughter, Donald E Harter, and Mayuram S Krishnan. Evaluating the cost of software quality. Communications of the ACM, 41(8):67–73, 1998. [29] Rini Van Solingen. Measuring the roi of software process improvement. IEEE software, 21(3):32–38, 2004. [30] Shuai Wang, Yinan Yu, Robert Feldt, and Dhasarathy Parthasarathy. Au- tomating a complete software test process using llms: An automotive case study. arXiv preprint arXiv:2502.04008, 2025. [31] Run Xu. The relationship between quality and efficiency in business manage- ment. Macro Management & Public Policies, 2(3), 2020. 57 Bibliography 58 A Appendix A.1 Non-utilized Methods I A. Appendix II Table A.1: Full non-utilized methods classification Name of Tech Context / Purpose Description Strengths Weaknesses FPA (Function Estimate project size Measures software size by evalu- - Can be used early in - Complex and time- Point Analysis) for planning efforts, ating the functional requirements project lifecycle consuming cost, and time pre- from a user perspective regard- - Well-established method - Subjective weighting development. less of technology or language. - Structured effort estima- - Requires training tion COSMIC Estimate project size Focuses on software’s data - Better suited for modern - Less common than FPA Function for planning efforts, movements (Entry, Exit, Read, AI and embedded systems method Points cost, and time pre- Write) to calculate size, ideal for - Precise measurement of - Functional user require- development. embedded/real-time systems like functional user require- ments needed automotive platforms. ments ROSQ (Return Calculate return on Quantifying the business value - Helps justify investment - Difficult to isolate qual- on Software investment and quan- gained from investing in software of quality ity impact and other fac- Quality) tifying quality. quality improvements. - Enables ROI-based tors decision-making - Requires internal com- pany information - Requires mature systems CMMI / Evaluate software Frameworks for assessing and - Systematic SPI evalua- - Resource-intensive CMM (Capa- Process Improvement improving organizational process tion - Bureaucratic and time- bility Maturity evaluation. maturity, especially in software - Recognized globally consuming Model Integra- and systems engineering. - High initial adoption tion / Model) barrier Continued on next page A. Appendix III Table A.1 continued from previous page Name of Tech Context / Purpose Description Strengths Weaknesses VSM (Value Identify waste, bottle- A lean tool that visually maps - Effectively highlights - Best suited for linear or Stream Map- necks, and improve- the steps in a process to identify inefficiencies and delays sequential processes ping) ment opportunities. inefficiencies and waste-creating - Visualizes value flow and - Less effective for com- Applied to assess AI activities. Used to understand handoffs plex, parallel workflows integration’s value and optimize end-to-end work- - Supports continuous like those in CS/API vs. effort, cost, and flows. improvement initiatives Testing investment. - Time-intensive to create and maintain A. Appendix A.2 Utilized Methods IV A. Appendix V Table A.2: Full utilized methods classification Name of Tech Context / Purpose Description Strengths Weaknesses Pareto Analy- Identify and priori- Uses the 80/20 rule to highlight - Simple, visual, highlights - May oversimplify sis tize key quality and the small number of factors caus- high-impact areas - Doesn’t suggest solu- efficiency issues in ing the majority of problems. - Helps focus efforts tions software processes. Ranks stakeholder-reported is- - Effective in early-stage - Quality depends on in- Determines where sues. analysis. put AI should focus its - Can miss rare but criti- efforts and whether it cal issues delivers impact. Software Pro- Evaluate whether AI Structured approach for assess- - Business-aligned - Requires baseline data cess Improve- components result in ing and guiding improvements - Flexible, - Hard to isolate effects ment Initiative measurable improve- to software processes based on - Supports continuous - Can become complex (SPII) ments in time, cost, or changes introduced (e.g., AI). improvement - Depends on consistent quality. Links tech- - Ties actions to measur- follow-up nical changes to busi- able outcomes ness outcomes. Automation Classify AI automa- Organizes tasks from manual - Structured risk aware- - Generic Taxonomy and tion levels and associ- to autonomous across appli- ness - Lacks detailed risk types Risk ated risk to determine cation points (process/produc- - Quantifies automation - Needs interpretation suitability in a given t/runtime), mapping them to levels - Limited predictive depth software process. risk levels. - Fosters cross-functional discussion Continued on next page A. Appendix VI Table A.2 continued from previous page Name of Tech Context / Purpose Description Strengths Weaknesses Automation Guide decision- Developer-informed logic tree - Simple, practical and - May oversimplify Decision Tree making between AI, that outlines criteria (e.g. struc- consistent - Doesn’t handle all edge scripting, or human ture, repeatability) for selecting - Captures expert knowl- cases effort for specific an automation method. edge - Ignores broader con- automation steps. - Team-adaptable straints - Not always predictive A. Appendix A.3 Round 2 interview/survey questions • Briefly describe and list the tasks and/or main sources of high efficiency in your/your team’s workflow? • Briefly describe and list the tasks and/or main sources of low efficiency in your/your team’s workflow? • Briefly describe and list the tasks and/or main sources of high quality in your/your team’s workflow? • Briefly describe and list the tasks and/or main sources of low quality in your/y- our team’s workflow? • What is your role and what is your experience working with component-level testing in years? VII