Investigating Text-only and Multimodal Retrieval Augmented Generation frame- works for Visual Question Answering A study on the impact of modality and parameter optimization Master’s thesis in Applied Data Science Marta Bortkiewicz & Cecilia Rundberg Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2024 Master’s thesis 2024 Investigating Text-only and Multimodal Retrieval Augmented Generation frameworks for Visual Question Answering A study on the impact of modality and parameter optimization Marta Bortkiewicz & Cecilia Rundberg Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2024 Investigating Text-only and Multimodal Retrieval Augmented Generation frame- works for Visual Question Answering A study on the impact of modality and parameter optimization Marta Bortkiewicz & Cecilia Rundberg © Marta Bortkiewicz & Cecilia Rundberg, 2024. Supervisor: Ashkan Panahi, Department of Computer Science and Engineering Advisor: Caroline Bükk, Wiretronic AB Advisor: Isak Ernstig, Wiretronic AB Examiner: Simon Olsson, Department of Computer Science and Engineering Master’s Thesis 2024 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2024 iv Investigating Text-only and Multimodal Retrieval Augmented Generation frame- works for Visual Question Answering A study on the impact of modality and parameter optimization Marta Bortkiewicz & Cecilia Rundberg Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Verifying the correctness of product assembly processes in a manufacturing setting is a crucial part of ensuring the quality of products. Automating this procedure can help improve both the security and the efficiency of the routines. Utilizing a machine learning model for automating these kinds of procedures would require fine-tuning and a lot of computational resources for the algorithm to adapt to the specific domain. An alternative approach that has been shown to enhance the per- formance to the same extent as fine-tuning while not requiring additional computa- tional resources, is utilizing Retrieval-Augmented Generation (RAG) frameworks. In this project, two RAG frameworks – a Text-only and a Multimodal RAG frame- work, are developed. The main goal of the frameworks is to accurately answer user queries about products where the ground-truth answer is located in a PDF user manual. The frameworks are developed by integrating a retrieval component and a generative component, either LLaMA2-7B or LLaVA-7B. The retrieval component retrieves relevant to the user query context from manuals, which the generative component uses to base the response on. In addition to exploring how the performances are affected by the modalities of the frameworks, parameter tuning is explored. Evaluating how different values of chunk size and top-k parameters affect the performances allows for optimizing the RAG frameworks. The evaluation is performed by using BERTScore metrics and LangSmith metrics that provide complementary human-like judgment. The most crucial metrics are B-recall and contextual accuracy, which both evaluate how well the generated response captures the information embedded in the ground-truth answer. The results show that the Text-only RAG framework is more stable across changes in the parameters than the Multimodal RAG framework, leading to generating more coherent and rational responses. However, finding the most optimal parame- ters for the Multimodal RAG framework could lead to it outperforming the Text- only RAG framework. Overall, moderate chunk sizes 128 and 256 and top-k values 4 or 6 led to the best-produced performances for both of the RAG frameworks. v Keywords: Data science, Retrieval-Augmented Generation, RAG, Multimodality, project, thesis. vi Acknowledgements We would like to say a big thank you to Wiretronic AB for providing us with the opportunity to collaborate on this Master’s thesis project. We are particularly grateful to Caroline and Isak for supporting us and brainstorming with us during challenging times. We would also like to express our appreciation to our academic supervisor Ashkan, who has given us valuable feedback during the entire process. Marta Bortkiewicz & Cecilia Rundberg, Gothenburg, 2024-06-23 viii x Contents List of Figures xv List of Tables xvii 1 Introduction 1 1.1 Aim of the project . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background & Related work 7 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 The RAG framework . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1.1 Text-only RAG . . . . . . . . . . . . . . . . . . . . 8 2.1.1.2 Multimodal RAG . . . . . . . . . . . . . . . . . . . 9 2.1.2 RAG vs Fine-tuning . . . . . . . . . . . . . . . . . . . . . . 10 2.1.3 Optimizing parameters . . . . . . . . . . . . . . . . . . . . . 10 2.1.3.1 Effects of chunk size . . . . . . . . . . . . . . . . . 11 2.1.3.2 Effects of top-k values . . . . . . . . . . . . . . . . 11 2.1.4 Generative Large Language and Vision models . . . . . . . . 12 2.1.4.1 LLaMA . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.4.2 LLaVA . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.5 Hallucination tendencies . . . . . . . . . . . . . . . . . . . . 14 2.1.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.6.1 BERTScore . . . . . . . . . . . . . . . . . . . . . . 15 2.1.6.2 LangSmith . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Manufacturing setting . . . . . . . . . . . . . . . . . . . . . 18 2.2.2 RAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 xi Contents 3 Methodology 21 3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Data extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.4 Text-only RAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4.1 Generating summaries . . . . . . . . . . . . . . . . . . . . . 23 3.4.2 Embedding and Retrieval . . . . . . . . . . . . . . . . . . . 24 3.4.3 Integrating LLaMA2-7B . . . . . . . . . . . . . . . . . . . . 25 3.5 Multimodal RAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5.1 Embedding and Retrieval . . . . . . . . . . . . . . . . . . . 26 3.5.2 Integrating LLaVA-7B . . . . . . . . . . . . . . . . . . . . . 26 3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4 Experiments 29 4.1 Quantitative analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 Qualitative analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5 Results 33 5.1 Quantitative analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.1.1 Evaluating chunk sizes . . . . . . . . . . . . . . . . . . . . . 33 5.1.1.1 BERTScore for Text-only RAG . . . . . . . . . . . 33 5.1.1.2 BERTScore for Multimodal RAG . . . . . . . . . . 34 5.1.1.3 LangSmith for Text-only RAG . . . . . . . . . . . 35 5.1.1.4 LangSmith for Multimodal RAG . . . . . . . . . . 36 5.1.2 Evaluating top-k values . . . . . . . . . . . . . . . . . . . . . 38 5.1.2.1 BERTScore for Text-only RAG . . . . . . . . . . . 38 5.1.2.2 BERTScore for Multimodal RAG . . . . . . . . . . 39 5.1.2.3 LangSmith for Text-only RAG . . . . . . . . . . . 40 5.1.2.4 LangSmith for Multimodal RAG . . . . . . . . . . 41 5.2 Qualitative analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.2.1 Manual 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.2.2 Manual 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2.3 Manual 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6 Discussion 47 6.1 Developing domain-specific RAG frameworks . . . . . . . . . . . . . 47 6.2 Parameter and modality impact on RAG framework performance . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.2.1 Key evaluation metrics for the domain-specific RAG frame- work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.2.2 Effects of modality . . . . . . . . . . . . . . . . . . . . . . . 49 xii Contents 6.2.3 Effects of parameters . . . . . . . . . . . . . . . . . . . . . . 50 6.2.3.1 Chunk sizes . . . . . . . . . . . . . . . . . . . . . . 51 6.2.3.2 Top-k values . . . . . . . . . . . . . . . . . . . . . 52 6.2.3.3 General observations . . . . . . . . . . . . . . . . . 52 6.3 Qualitative analysis overview . . . . . . . . . . . . . . . . . . . . . 53 6.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.5 Risk analysis and ethical considerations . . . . . . . . . . . . . . . . 56 7 Conclusion 57 Bibliography 59 A Appendix 1 I xiii Contents xiv List of Figures 1.1 General structure of RAG integrated with an LMM. . . . . . . . . . 2 2.1 Text-only RAG structure. . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Multimodal RAG structure. . . . . . . . . . . . . . . . . . . . . . . 9 3.1 Simplified chunking process schema. . . . . . . . . . . . . . . . . . . 23 5.1 Plot of BERTScore for Text-only RAG for different chunk sizes. . . 34 5.2 Plot of BERTScore for Multimodal RAG for different chunk sizes. . 35 5.3 Plot of LangSmith scores for Text-only RAG for different chunk sizes. 36 5.4 Plot of LangSmith scores for Multimodal RAG for different chunk sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.5 Plot of BERTScores for Text-Only RAG for different top-k values. . 38 5.6 Plot of BERTScores for Multimodal RAG for different top-k values. 39 5.7 Plot of LangSmith scores for Text-only RAG for different top-k values. 40 5.8 Plot of LangSmith scores for Multimodal RAG for different top-k values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 A.1 Generated responses by LLaVA-7B, the Text-only RAG and the Multimodal RAG framework for questions about the Dell manual. . II A.2 Generated responses by LLaVA-7B, the Text-only RAG and the Multimodal RAG framework for questions about the Samsung manual. III A.3 Generated responses by LLaVA-7B, the Text-only RAG and the Multimodal RAG framework for questions about the Sony manual. IV A.4 An example page from one of the manuals chosen for evaluation. . . V xv List of Figures xvi List of Tables 5.1 Table of BERTScore for Text-only RAG for different chunk sizes. . 33 5.2 Table of BERTScore for Multimodal RAG for different chunk sizes. 34 5.3 Table of LangSmith scores for Text-only RAG for different chunk sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.4 Table of LangSmith scores for Multimodal RAG for different chunk sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.5 Table of BERTScores for Text-only RAG for different top-k values. 38 5.6 Table of BERTScores for Multimodal RAG for different top-k values. 39 5.7 Table of LangSmith scores for Text-only RAG for different top-k values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.8 Table of LangSmith scores for Multimodal RAG for different top-k values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.1 Comparison of the scores achieved by the best-performing configu- rations of the Text-only and Multimodal RAG frameworks. . . . . . 49 xvii List of Tables xviii List of Abbreviations QA Question Answering RAG Retrieval-Augmented Generation LLM Large Language Model LMM Large Multimodal Model AI Artificial Intelligence LLaVA Large Language-and-Vision Assistant LLaMA Large Language Model Meta AI NLP Natural Language Processing CLIP Contrastive Language-Image Pre-Training BERT Bidirectional Encoder Representations from Transformers OCR Optical Character Recognition xix 0. List of Abbreviations xx 1 Introduction In manufacturing, the successful execution of product assembly processes relies on standardized instructions that guide line workers through the procedures. Verifi- cation of assembly processes is usually purely manual, which is considered to be time-consuming and prone to errors, calling for more efficient verification meth- ods. Despite the critical role of assembly in industrial settings, utilizing machine learning algorithms to answer questions related to assembly verification remains under-explored compared to open-domain context-based Question Answering (QA) [1]. One reason behind this is the lack of standard benchmark datasets for assem- bly verification QA. Despite limited progress, the interest in automating human performance in everyday tasks such as verifying assembly processes is rapidly grow- ing. As such, the ability to derive meaningful insights from multimodal data, i.e. a combination of visual and textual information, is a crucial skill for machines to achieve it [2], [3]. Recognizing the need for automated assembly verification methods, an emerging method called Retrieval-Augmented Generation (RAG) can address the challenge. It combines information retrieval with a generative component, either a general- purpose Large Language Model (LLM) or a Large Multimodal Model (LMM). The retriever component extracts information from an external data source, while the generative model produces a text response based on the retrieved context. A simplified structure of a RAG framework is presented in Figure 1.1. In this project, the aim is to build and explore RAG frameworks capable of an- alyzing the content of different modalities within electronics assembly manuals. Further, the aim is to integrate such a framework with a generative model to ad- dress worker queries effectively. Our project is a part of the long-term research at Wiretronic AB for designing a VQA technology where the workers can interact with a machine to receive guided instructions, instead of manually referring to the written instructions. The company has previously been doing research on tech- nologies for extracting visual information using segmentation networks. Building upon this foundation, our investigation aims to extend the modality of data by 1 1. Introduction Figure 1.1: General structure of RAG integrated with an LMM. incorporating text. In particular, it is intended to explore if using context from manuals, retrieved by a RAG framework as an input prompt to an open-source LLM or LMM, can generate accurate answers to workers’ queries. By leveraging LLMs and LMMs, the aim is to explore new possibilities for the interaction of line workers with machines for future assembly processes to be verified and guided with high precision and minimal human supervision. Driven by the growing interest in multimodality, which consists of image and text data, the VQA field emerged recently within the Artificial Intelligence (AI) community. Its goal is, by providing a natural language question about an image, to generate an accurate natural language answer to the image [4]. Currently, among the open-source models developed in this field, Large Language-and-Vision Assistant (LLaVA) [5] stands out as one of the most successful models [6]. Existing VQA tasks are designed to answer questions from web pages or infographics [7], which is not suitable for user manual VQA. They mainly consider single-page documents while product user manuals are composed of multiple pages that should be processed altogether. This research gap needs to be filled [8]. In this project, information retrieval plays a pivotal role as the framework aims to reason over multi-page documents. The framework aims to understand the text, layout, and visual elements of each page to extract regions relevant to a query that would serve as a prompt to an LLM or an LMM. The application of prompting with extracted information remains under-explored for LLMs and LMMs [9], which underscores the importance of exploring RAG frameworks to 2 1. Introduction enhance the accuracy of these models in processing different data modalities in this domain-specific application. 1.1 Aim of the project This project aims to build and explore two RAG frameworks, with capabilities of interpreting different modalities, and integrating them with either Large Language Model Meta AI (LLaMA) [10] or LLaVA. The first RAG framework, called the Text-only RAG framework, can interpret and embed text data. It has an additional generative component that transforms all image data into text descriptions. The other RAG framework, called the Multimodal RAG framework, can interpret and embed both text and image data. Hence, the two RAG frameworks are built according to different approaches on how to process both text and image data from user assembly manuals. Further, this project aims to evaluate how accurately the two frameworks can generate answers based on the retrieved information from the assembly manuals. This goal requires finding the most optimal strategy to generate answers to a worker’s query about the assembly process of wire harnesses. It is achieved by a strategic optimization of the frameworks through testing of different parameters. Two parameters are explored, the chunk size of the retrieved text data and the top- k number of retrieved documents. Evaluating different values of these parameters allows for optimizing the performances of the RAG frameworks. The project outline consists of the following steps: • Build and explore different modalities of retrieval models that extract texts and images relevant to the query from PDF files to prompt LLaMA2-7B or LLaVA-7B. • Optimize the RAG framework to find the most efficient strategy to generate answers to users’ queries about assembly manuals. • Evaluate and compare different configurations of the frameworks. 1.2 Research questions Given the aim of this project, the following research questions are formulated: 1. Is it possible to integrate a pre-trained LLM or LMM with a retrieval com- ponent in a RAG framework to generate responses to domain-specific ques- tions? 3 1. Introduction 2. How do the modality and the parameters of a RAG framework affect the performance of generated responses? Building upon the foundation of prior research in the field of VQA, this project aims to extend the exploration of the effectiveness of VQA models in the specific context of electronics assembly processes. We use insights drawn from the afore- mentioned studies to incorporate textual instructions, visual information, and user queries, to develop two retrieval mechanisms and integrate them with an LLM or an LMM into RAG frameworks. 1.3 Motivations While there exists a lot of research for the application of VQA models in various domains such as biomedical imaging, the adaptation of these models in the field of manufacturing, specifically assembly processes, remains under-explored. Devel- oping a RAG framework for this specific domain and integrating it with an LLM or LMM will open opportunities to apply the same kind of structure to various types of domains. Working towards automation is crucial since it will minimize hu- man error, increase safety, and support the decision-making process. Automation reduces manufacturing costs and shortens the production cycle. Furthermore, RAG frameworks have been shown to produce the same level of en- hanced accuracy of VQA answers as fine-tuning the LLM into the specific domain. Since fine-tuning comes with high computational costs, utilizing a RAG frame- work is a more efficient and flexible solution for domain-specific tasks. By using domain-specific data, the framework adapts to the domain and extracts the rele- vant information for the user manuals with less required computational resources [11]. The insights gained from this project can extend beyond the electronics manufac- turing domain, influencing the broader landscape of Natural Language Processing (NLP) and Computer Vision, by proposing more interactive and adaptable sys- tems. From Wiretronic’s point of view, the motivation is to gain more knowledge about how multimodal models can be used as a tool for manufacturing processes. The long-term goal, which is beyond the scope of this project, is to be able to implement an automated model that can be used as an aid tool for manufacturing workers to verify the assembly process. This project is the starting point of this long-term goal, with expectations to gain insights that can be further elaborated on in the future. 4 1. Introduction 1.4 Challenges Several challenges were encountered during the development of the RAG frame- works. Due to limited access to manuals that naturally consisted of both text and image data, the implementation of extracting these data required additional effort. Every page of each manual was in image format, which included all texts and images being in image format. Instead of extracting text paragraphs from the manuals as text data, they first had to be converted to a natural text format. This resulted in extra steps in the implementation of the extraction stage. Another encountered challenge was the running time of the entire procedure, from extracting the text and image data from the manuals to the final evaluation. Due to the long running times, not all manuals could be included. The most optimal option would be to use all 209 manuals present in the dataset to get the most accurate evaluation scores. However, the time limit of the project only allowed for using a subset of 10 manuals for the evaluation. 5 1. Introduction 6 2 Background & Related work 2.1 Background Due to the amount of computational resources required to fine-tune an LLM or an LMM, it becomes harder to adapt and utilize it for specific domains. RAG frameworks work as a solution to this issue, making it possible to retrieve, use, and incorporate data from external documents in the process of utilizing a generative model for domain-specific tasks. The RAG framework does not rely on the QA model being fine-tuned into the domain nor requires that it is trained on up-to-date data [12]. 2.1.1 The RAG framework The very first stage before any execution of the RAG framework can take place, is preprocessing the input documents properly. The documents need to be split into representations of the different data types, usually images, texts and tables. The extraction stage consists of specifying which data can later be retrieved by the model and how it should be retrieved. Important factors to consider when deciding how the data should be extracted is the size of the chunks that the text should be split into. The size can impact to what extent the retriever is capable of capturing context. To avoid information loss, the chunks should have a sufficient size. A second factor to consider is how tables should be represented in the extracted data, and if they should be a part of the text chunks or be represented separately from the text data. This should be decided given the domain and the context, and if table data are crucial for the specific purpose. Another aspect that needs to be carefully considered, is how images and specifically text in image format should be represented. For some tasks, keeping text in image format is a better choice, which for example can be an image of a chat. In other cases, the different text messages in the image of the chat are necessary for capturing the context and need to be extracted as text data [13]. 7 2. Background & Related work After the data has been extracted from the documents as separate texts, alongside table and image representations, the embedding takes place. The data is encoded and mapped into vector representations in a single embedding space. The user’s query inputted to the framework is embedded with the same method, and a simi- larity search between the query and the embedded data is performed. The chunks that get the highest similarity score are then retrieved. Here, an important factor that needs to be decided is the number of documents that are going to be retrieved, called the top-k value. It can affect the generation process by providing either ad- equate, not sufficient, or noisy context to the prompt. The retrieved parts are augmented with the user query into the prompt. The augmentation is performed to enhance the context of the input that is later prompted to the LLM or LMM. Optimizing the input will improve the models’ capability to generate an accurate final response [13]. There are several approaches on how to handle multimodal data in documents that will be retrieved by the RAG framework. Depending on the approach, the structure of the RAG framework will differ between text-only and multimodal. 2.1.1.1 Text-only RAG Figure 2.1: Text-only RAG structure. One approach is based on narrowing down the multimodal data into representa- tions of a text-only format. First, the extraction is performed where text, tables, and images are saved separately. Before the embedding is performed, the images and tables are passed into an LMM. The LMM prompted with an image and a question regarding the image, generates a response to the question in text format only. Extracted tables and images are prompted into the LMM together with a prompt that tells the model to make explicit summaries of the content. After summarizing the content of images and tables, all of the information from the original document is represented in text format and the embedding is performed. The textual data is embedded into vectors. Due to the data being transformed 8 2. Background & Related work into a single modality, a text embedding model can be used to create text feature vectors for the whole document and an LLM can be utilized to produce the final The data flow structure in the Text-only RAG framework is illustrated in Figure 2.1. 2.1.1.2 Multimodal RAG Another approach on how to treat multimodal data in documents that later will be retrieved by the RAG framework is to extract and keep the raw images and tables. In this case, a multimodal embedding model is used which makes it possible to embed text, image, and table data into the same vector space. Due to the extracted data having different modalities, an LMM has to be utilized to produce the final answer. This process is illustrated in Figure 2.2. Figure 2.2: Multimodal RAG structure. The two approaches have certain advantages and drawbacks. The Text-only RAG framework that uses a text embedding model risks losing context that can be crucial for generating an accurate and precise response. This drawback can appear because it retrieves the summary of the information incorporated in the images and not the raw images themselves. The Multimodal RAG framework, that uses a multimodal embedding model instead, allows for capturing all the raw content from the document and minimizes the risk of losing crucial context. However, the multimodal approach comes with challenges as the complexity grows when the modalities are many. Extracting and embedding different types of modalities accurately is more complicated [14]. 9 2. Background & Related work 2.1.2 RAG vs Fine-tuning Fine-tuning an LLM or LMM comes with a lot of advantages and can make the model adapt to a new domain and give a state-of-the-art performance in generating answers to domain-specific questions. In addition to the advantage of very accurate responses, the required size of input and output tokens remains the same and does not require more computational resources. LLMs are trained on huge amounts of data and typically contain billions of parameters. Fine-tuning includes extending the training of the model by adding more data of the desired domain, allowing the model to update its parameters accordingly. This process involves traversing the LLM through all of its parameters. Due to the huge size and parameter amount of many recently developed LLMs and LMMs, this process can become increasingly computationally expensive and time-consuming. Depending on the task and goal of adapting an LLM or LMM into a specific domain, the advantages and disadvantages should be carefully considered [11], [12]. Due to the limitations of fine-tuning a QA model, the developments of RAG frame- works have become more of interest. The main advantage of using RAG is the small amount of needed computational resources in comparison to fine-tuning. As the LLM or LMM is treated as a black box when using RAG, its parameters are not defined or updated. This leads to a drastically decreased initial cost, as instead of fine-tuning the corresponding process is to create the embeddings, making the pro- cess more flexible [12]. The flexibility of using a RAG framework also lies in being able to decide, change, and add data that should be included without making the process more complex. Further, both of the methods have been shown to be able to produce similar improvements in the overall performance. RAG frameworks are also known to be effective in tasks where the data is contextually relevant [11]. 2.1.3 Optimizing parameters One of the most important parts of the RAG framework is the construction of the retrieval vector store and the retriever itself [15]. In the RAG framework, the components of the retriever significantly impact the overall performance by defining how effectively the framework retrieves and utilizes relevant information from the vector store. The key parameters of the retriever are chunk size and top-k value. Although chunking itself is part of the data extraction phase, while top-k chunks retrieval happens at the end of the retrieval phase, they both influence the behavior and the performance of the retriever component directly. The impact of these two parameters is described in this section. 10 2. Background & Related work 2.1.3.1 Effects of chunk size RAG frameworks are sensitive to the chunking method chosen to split data into smaller units, which are stored in a vector store. Chunking means breaking down large inputted documents into smaller segments of fixed length called chunk size. Each chunk should keep specific information essential for addressing user queries. A good chunking strategy is crucial to ensure high relevance and accuracy of the retrieved context [15], [16]. Hence, chunking tries to retrieve the context with minimal noise while maintaining semantic relevance [17]. The size of the chunks determines the breadth of the context retrieved by the retriever, which makes it a critical parameter in the retrieval phase. When the chunk size is not adapted to the task, there is a risk that too little or too much information is going to be included in the context for the given query. While smaller chunk sizes might speed up the retrieval phase, they can include incomplete context that is unable to provide enough necessary information for the generation process. For example, chunk sizes like 128 or 256, can capture finer semantic details but can miss some critical information. On the other hand, too big chunk sizes, like the ones containing 512 or 1024 tokens, can preserve more extensive context, but it comes with a risk of containing too much irrelevant, not precise information that can slow down and confuse the generation process [17]. The ideal chunk size should maintain the high accuracy of the generated answer while preserving all the necessary information in the context. To find it, it is recommended to run empirical experiments with various chunk sizes. Besides, the optimal chunk size should also be chosen based on the nature of the documents stored in the vector store, the length of the user query, and specific requirements of the task [16]. In this project, the documents that are used are electronics manuals, which are long, but contain a lot of concise information. The challenge with this data type is finding a balance between granularity and comprehensiveness. Ideally, the optimal chunk size would capture the essence of each step described in the manual while also maintaining an overview of how it contributes to the larger assembly process. 2.1.3.2 Effects of top-k values The top-k value is another critical parameter of the retrieval phase in the RAG framework. It is the number of text chunks retrieved for each query, which is why it determines the capacity and quality of the retrieved context. The retriever finds the top-k vectors most similar to the query vector in the vector store and uses these to retrieve the respective text chunks. These text chunks are then used to prompt the LLM or LMM, hence, the size of the chunks matters and may affect the final generated result [15]. 11 2. Background & Related work The amount of information that the model receives in a prompt is dependent on the size of k. The ultimate goal is to retrieve information that is both comprehensive and relevant. When the top-k value is too small, it can lead to the problem of information scarcity. Then, the essential data from the vector store will not be included in the prompt, causing generation of incomplete or inaccurate answers. On the other hand, when the top-k value is too high, it becomes harder to recognize the relevant chunks and can lead to less accurate or incoherent responses [15]. A too-high top-k value also creates a risk of retrieving irrelevant chunks, which may introduce noise and lower the quality of generated responses. In addition, it is usually more computationally expensive and time-consuming to process a larger number of chunks. Hence, finding the optimal top-k value is very important to build an efficient and accurate RAG framework. There is a need to find a balance between providing all the needed context without causing information overload, as it exists for choosing the optimal chunk size. This can be achieved through empirical testing. Finding top-k documents can be even further optimized by incorporating a re- ranker or a dynamic top-k retrieval. Most vector stores use vector similarity search criteria to search through vectors. However, computing this similarity score be- tween document chunks and the prompt does not always return relevant contexts [18], [19]. In that case, a re-ranker model is a beneficial addition, as it reevaluates the top-k chunks based on criteria other than vector similarity — for example keyword search [18], or models such as cross-encoders. Integrating a re-ranker into RAG frameworks makes the retrieved context smaller and more relevant to the query. On the other hand, a RAG system enhanced with a re-ranker uses more computational resources than a basic vector-similarity-based RAG system [19]. Moreover, in several cases, depending on the complexity of the question, a differ- ent number of top-k chunks should be retrieved. In this scenario, dynamic top-k retrieval is used in contrast to static top-k retrieval. It adapts the number of re- trieved chunks to the complexity of each query. This can be done by training a cross-encoder to predict the most suitable top-k value for each retrieval task. Dy- namic top-k retrieval ensures high relevance and an optimal amount of retrieved information, simultaneously reducing computational costs by omitting the process- ing of unnecessary information [20]. However, such an approach is only suitable for tasks where the questions have significantly different levels of complexity. 2.1.4 Generative Large Language and Vision models The integration of an LLM or an LMM within the RAG framework serves as the final step of the framework, delivering the textual output. These models rely on transformer architecture and self-supervised learning to generate human-like text. 12 2. Background & Related work They are pre-trained on extensive text corpora and have a deep understanding of natural language, text coherence, and contextual relevance [21]. However, they encounter challenges when the situation requires an understanding of specific in- formation from the external data source. When handling domain-specific or highly specialized queries [22], it is common that they can generate incorrect information, referred to as hallucinations [23]. These limitations emphasize that LLMs or LLMs should not be implemented as solutions in real-world manufacturing environments without additional safeguards [13]. They also cannot learn and retain new infor- mation without undergoing a retraining process, which is computationally and time intensive. Therefore, integrating them into the RAG framework to produce accurate and relevant responses is valued. This integration combines the compre- hensive internal knowledge from language models with external data retrieval. It can also enhance the models’ ability to provide accurate and precise responses. LLaVA and LLaMA are one of the question-answering models that can be utilized as generative models in the RAG framework, depending on the modality of the information retrieved by the retriever. LLaMA is leveraged to produce summaries of text paragraphs, while LLaVA is used to provide text summaries of images present in the documents. When the output of the retriever is only in text format, LLaMA is used to produce the final answer to the user query. In cases where the output is multimodal, LLaVA is used instead. LLaVA and LLaMA can be run locally with Ollama, a local inference framework client. The local execution that this framework provides, ensures data privacy, as the information is not externally shared anywhere. 2.1.4.1 LLaMA LLaMA is a foundational large language model that works only with text modality by taking a sequence of words as input and recursively generating text. It is based on a transformer architecture with implementations of optimizers and Causal Multi-Head attention to improve performance. LLaMA models are available in several sizes between 7-65B parameters and they can reach similar or even better results on several benchmarks as ground-breaking larger models when enough data is used for training. Smaller LLaMAmodels, trained on more tokens, are also easier to adapt to specific use cases [10]. 2.1.4.2 LLaVA LLaVA is built on the foundation of LLaMA, combined with an image encoder and a text decoder, allowing it to integrate a visual and a textual embedding space [5]. The Contrastive Language-Image Pre-Training (CLIP) model [24], based on contrastive language-image pre-training, is used as the image encoder that 13 2. Background & Related work converts the image into the same vectors of number matrices as text. It does so by connecting the visual features from input images to language embeddings through a trainable projection matrix. These visual tokens, possessing the same dimensionality of the word embedding space as the language tokens, are integrated with the user text prompt. Then, the LLM component of LLaVA generates the final text response. LLaVA has shown to be able to adapt and produce state-of- the-art performance within various domains, such as the challenging Science QA benchmark [25]. 2.1.5 Hallucination tendencies As for many LLMs and LMMs, LLaMA and LLaVA tend to hallucinate when generating a response. A hallucination is a made-up answer to a question, that typically comes off as being true. The response is written as factual information, which can make it hard to detect a hallucination. Because of this, it is important to be cautious when interacting with an LLM or LMM about subjects that are outside the user’s expertise [26]. Several factors have been shown to trigger hallucinations for LLaMA. In the study LLM Lies: Hallucinations are not Bugs, but Features as Adversarial Examples by Yao et al., it is shown that the format of the prompt can have an impact on the extent to how often a hallucination is triggered or not. Two kinds of modifications of a typical prompt were tested to see how they would affect the outcome. The first modification kept the semantic context of the prompt but had a few tokens changed to random tokens. The second modification included randomizing the initial tokens of the prompt, leading to a non-specified semantic context. The two different formats showed that hallucinations were triggered by a rate of 54% and 31% respectively. These results indicate that careful prompt engineering is an important factor to avoid hallucinations [26]. For LLaVA, which uses the multimodal embedding model CLIP, there are other challenges to consider to avoid hallucinations. Since CLIP is able to encode both textual and visual data, there is a risk that there becomes an information gap between the two types of data. The gap can lead to an increased risk of triggering hallucinations. Hence, the part of the model that aligns the two data modalities needs to keep a high level to minimize the potential impact of this issue. Another aspect that has been shown to trigger hallucinations for LLaVA is the resolution of the images that are passed to and encoded in CLIP. A lower image resolution has been proven to be a factor that triggers hallucinations. This issue is probably caused by lacking visual information in low-resolution images [27]. 14 2. Background & Related work 2.1.6 Evaluation The evaluation of generative tasks in machine learning poses specific challenges, which are different from what is known for traditional classification or regression tasks [28]. During the evaluation of a RAG framework two key stages – retrieval and generation phase, should be assessed separately. When evaluating the retrieval quality, the relevance of the retrieved documents to the user query is calculated. The generator’s assessment tests how coherent and relevant is the answer produced from the retrieved context. By assessing these stages separately, the quality of the retrieved context and the accuracy of produced content are both examined. The issue with traditional quantitative metrics like BLEU or METEOR is that they often fall short when capturing the domain-specific effectiveness of RAG models [29]. These N-gram metrics do not account for word order or semantic variations. One metric that addresses these shortcomings is BERTScore. It is an evaluation metric based on Bidirectional Encoder Representations from Transformers (BERT) embeddings. It measures how similar the generated response is to the ground truth answer. However, since there are various ways to sufficiently answer a query in written language, the performance measurement still relies on subjective judgment. Therefore, to supplement the BERTScore output with more ’human-like’ judgment, LLMs or LMMs can be utilized to assess the generated answers. Researchers have named the approach of utilizing an LLM to evaluate responses of LLM-based RAG framework as the "LLM-As-A-Judge" [30] approach. In the case of Multimodal RAG, a judging model capable of considering both textual and visual context is required, therefore an LMM needs to be utilized. Retrieval quality can be evaluated most fundamentally by calculating page-level and paragraph-level accuracy [16]. It involves comparing the manually selected ground truth section from the pages of the document with the chunks returned by the retrieval algorithm. When the reference and the retrieved context are located on the same page or paragraph, the page-level or paragraph-level accuracy will be high. Since there are no manually selected ground truth regions of text in the dataset used in this project, only ground truth answers, it is needed to employ the LLM-As-A-Judge approach. 2.1.6.1 BERTScore To evaluate the semantic similarity between the generated response and the ground- truth answer, BERTScore is used. It employs pre-trained BERT contextual em- beddings for both the generated and reference answers. BERT contextual embed- dings, unlike regular ones, can produce different vector representations for a given word in different sentences, depending on the surrounding words that establish 15 2. Background & Related work the context of the target word [31]. Each word’s representation is calculated us- ing a Transformer encoder, which iteratively employs self-attention and nonlinear transformations. Then, the pairwise cosine similarity between each token xi in the reference sentence and each token x̂j in the candidate sentence is calculated. The cosine similarity of these two non-null vectors is calculated as: x Ti x̂j (2.1) ||xi||||x̂j|| Since pre-normalized vectors are used, the similarity is reduced to the dot product: x Ti x̂j (2.2) The complete BERTScore consists of precision, recall, and F1 metrics. Calculating recall involves matching each token in reference x with a token in candidate x̂ while calculating precision involves matching each token in x̂ with each token in x. Then, greedy matching is used to maximize the similarity score. The F1 score is calculated by combining precision and recall [31]. Below, the equations for calculating recall, precision, and F1 score are shown: 1 ∑ R TBERT = maxx̂ ∈x̂xi x̂j (2.3)|x̂| jxi∈x 1 ∑ PBERT = max T| | xi∈x xi x̂j (2.4) x̂ x̂j∈x̂ FBERT = 2 PBERT ·RBERT (2.5) PBERT +RBERT The final step of calculating the BERTScore involves re-scaling the output values to make them more human-readable. Since the cosine similarity values lie in a very limited range between [-1, 1] interval, BERTScore is re-scaled linearly, as follows: = RBERT − bR̂BERT (2.6)1− b After re-scaling, R̂BERT falls typically between 0 and 1, and the same proce- dure is applied for PBERT and FBERT . The constant b is derived from averaging BERTScores calculated on randomly paired candidate-reference sentences from Common Crawl monolingual datasets. 16 2. Background & Related work BERTScore allows for the evaluation to be more precise than evaluation metrics that make use of N-gram methods, such as BLUE and METEOR. As discussed, N-gram-based metrics come with several drawbacks. For instance, the BLEU score [32] simply only assesses the N-gram overlap between the candidate and the refer- ence. One drawback of such an approach is the inability to capture dependencies that may be located far from each other in a text. In contrast, BERTScore utilizes the aforementioned pre-trained contextual embeddings, which capture the context of words and can recognize the order and distant dependencies in the text. Also, N- gram methods do not usually perform well on texts that are rewritten with the use of synonyms or summaries of an original text. BERTScore can detect paraphrases in comparison to N-gram-based metrics which assign low scores to semantically correct sentences that rather deviate from original sentences, only assigning high scores to similar tokens. In BERTScore, on the other hand, computing the sum of the cosine similarities between token embeddings allows for detecting paraphrases. By overcoming the shortcomings of N-gram-based metrics, BERTScore has shown to be a more reliable evaluation metric. It has been also proven to correlate with human judgments, which is an important indicator when evaluating text genera- tion tasks [31]. 2.1.6.2 LangSmith LangSmith is a platform provided by LangChain, that allows users to track, eval- uate, and monitor ongoing processes that are powered by LLMs and LMMs. It provides real-time monitoring of the models epochs and uses traces to log almost every aspect of each run. It is possible to view and get statistics of these results with the available logging and visualization components. Additionally, the LangSmith API offers several built-in metrics that follow the LLM-As-A-Judge method for in-depth evaluation. They are a valuable tool to support traditional evaluation methods when dealing with generated content that has complex language nuances and requires contextual understating. One down- side is that they return a binary score for each data point. Therefore, to accurately measure differences in prompt or model performance, it is most effective to aggre- gate results across a larger dataset [33]. In this project, the chosen metrics for evaluating the Text-only and Multimodal RAG frameworks are contextual accuracy with chain of thought (COT) reasoning, coherence, and relevance. Contextual accuracy is a standard metric that measures the correctness of a generated response to a user query. Coherence and relevance are both part of the Labeled Criteria metrics from LangSmith, which ask an LLM in the prompt to provide reasoning behind assigning a label for a prompted criteria. 17 2. Background & Related work The following metrics specifically operate by: • Contextual accuracy works by instructing an LLM to grade a response as "correct" or "incorrect" based on the ground truth answer. It is enhanced by the chain of thought reasoning providing examples of a logical progression of thoughts before determining a final verdict. This approach helps to better align the responses with human judgment. • Coherence tests how well the response is structured sequentially and logi- cally. The criteria prompted to an LLM along with the generated and ground truth answer in this evaluator is: "Is the submission coherent, well-structured, and organized?" The answer that is labeled 1 should be organized and easy to read. It should consist of text that addresses the topic at hand. • Relevance measures how well the generated response matches the question. The criteria prompted to an LLM along with the generated and ground truth answer in this evaluator is: "Is the submission referring to a real quote from the text?" The answer that is labeled 1 should be genuinely relevant to the user query posed, once compared with the query. 2.2 Related work In this section, previous studies with findings relevant for the goal of this project are presented. The studies are divided into sections related to the manufacturing setting of the project and RAG frameworks. 2.2.1 Manufacturing setting Previous research in the field has shown that it is possible to develop a VQA model that increases the quality and production efficiency of human-technology manufacturing processes. In the study Digital twin improved via VQA for vision- language interactive mode in human-machine collaboration by Wang et al. (2021), a VQA model is developed to give responses to different kinds of questions re- garding the manufacturing process. The core of this model is based on Computer Vision and NLP and can generate a response to either open-ended questions or multiple-choice questions. A Convolution Neural Network (CNN) is used as a first step in the model and contributes to understanding the visual input. The second 18 2. Background & Related work step of the implementation is a Long Short-Term Memory (LSTM) network which contributes to text processing and language understanding. Finally, a fusion be- tween the visual and textual features takes place to prepare for the decoding of the generated answers. The authors of the paper concluded that the VQA model manages to answer both open-ended and multiple-choice questions which results in it being able to identify certain problems and challenges during the manufacturing process [34]. Zhang et al. [7] have recently developed a framework called Multimodal Product Manual Question Answering (MPMQA), which interacts with product manuals to retrieve a relevant part as an answer to a user’s query. Unlike most of the existing models that leverage only textual information [1], MPQMA requires the model to comprehend both the visual and the textual contents. Given a textual question and a multipage digital user manual, MPMQA provides a multimodal answer for a given question. To support this task, a large-scale, diverse dataset called PM209 with human annotations was created. It consists of 22,021 QA pairs from user manuals of electronic brands. MPMQA addresses two stages – page retrieval and multimodal QA. The model employed for this task is the Unified Retrieval and Question Answering (URA) model, which consists of a URA Encoder, URA Decoder, and Region Selector. In the page retrieval stage, the model encodes questions and pages separately and calculates their relevant scores with token-level interaction. In the multimodal QA stage, the model encodes questions and pages jointly and produces the textual parts and visual parts of the multimodal answer through the Decoder and Region Selector. Finally, URA is optimized in a multitask learning manner. It achieves competitive results compared to multiple task-specific models and proves successful in both information retrieval and multimodal QA tasks. 2.2.2 RAG As of the very first development of the RAG framework, the core idea was to bridge the field of generative AI with retrieval-based systems to enhance overall performance. Since then, a lot of prominent developments have been made in the field, contributing to technical improvements and a broader range of application areas [35]. As one of the recent developments in the field of RAG, Tang et al. presents the benchmark dataset MultiHop-RAG which contains queries and ground-truth answers that are located across multiple documents. In the report, it is presented how well some of the most prominent embedding models and generative models, such as GPT-4 and LLaMA2-70B, perform on these types of queries. The core idea is to measure the extent to how well the models can retrieve and generate responses 19 2. Background & Related work based on information located across multiple documents. What is shown in the study, is that the models do not manage to perform as well as expected on multi- hop queries as on queries whose answers can be found in a single document. When utilizing RAG frameworks for real-world applications, the assumption that the response to a query can be retrieved from multiple sources should be addressed for optimizing accurate results as well as for ethical reasons. Bringing the MultiHop- RAG benchmark, the authors address where the current RAG framework contains some flaws and that there is room for further development [36]. Another recent framework, developed with the purpose of further advancing the usage of RAG, is Retrieval-Augmented Planning (RAP). This framework is based on the core structure of the RAG framework and contains a memory. The mem- ory allows the model to draw on past experiences, which will be retrieved and utilized for generating a response to the current query. Allowing an LLM to base its responses on both provided context and on context from past experiences, it advances the capability of planning and decision-making. These advancements can be utilized to guide the user, based on the query, through a set of substeps to reach the goal. The RAP framework has been shown to produce a state-of-the-art performance when integrated with text-only modality LLMs. Integrating with an LMM, the framework improves the performance slightly, but there is room for improvements [21]. 20 3 Methodology This chapter describes the methods used to build and evaluate the Text-only and the Multimodal RAG frameworks. First, sections 3.1 to 3.3 give an overview of the data that is used, detailing how they are extracted and processed from raw PDF files into a format suitable for input to the RAG frameworks. Next, the consecutive steps of RAG frameworks’ implementations are described separately for the two frameworks. Section 3.4 focuses on the Text-only RAG, while section 3.5 addresses the Multimodal RAG. Both sections consist of subsections describing first the embedding and retrieval processes followed by the integration of an LLM or LMM model. For the Text-only RAG, an additional subsection describes the process of generating summaries, which is unique for this framework. At the end, a brief overview of the evaluation methodology is presented in section 3.6. A detailed explanation of specific experiments is included in the next chapter Experiments. 3.1 Data The baseline dataset that is used in this project is the PM209 open-source dataset containing digital product manuals from well-known consumer electronic brands. The dataset is chosen since the structure, length, and nature of data in the manuals resemble the characteristics of Wiretronic’s assembly manuals. The dataset consists of 22,021 QA pairs from 209 product manuals among 27 consumer electronic brands. Each question is in a text format and has a corresponding multi- modal answer that includes text and related visual regions from the manuals. The dataset is diverse, with the manuals being 10 to 500 pages long and covering vari- ous subjects from more than 90 different product categories. The question-answer pairs were designed to emphasize the multimodal content in product manuals and to support the VQA task. Due to the computational resources and time limitations, a subset of 10 manuals, that resemble Wiretronic’s assembly manuals the most, will be used. An example page of one of the used manuals can be seen in A.4. 21 3. Methodology 3.2 Data extraction The first step of building the RAG frameworks is to choose the method that ex- tracts raw data from the documents. Given the domain-specific application of this project, all of the data formats including text blocks, images, and tables, are relevant for the context. The extraction phase is similar across both of the RAG frameworks, where different techniques are used for image extraction and text extraction. To extract images from the documents, the module partition_pdf is used. This module belongs to the Unstructured library which specializes in processing raw and unstructured data in documents. Based on the task, the document data can be treated and grouped by different chunking strategies, where in this case it groups the data based on a ’by_title’ strategy. All of the images are extracted by this strategy and saved to an output directory. To extract the text data from the documents, the PyMuPDF library is used. Since the dataset contains a lot of text in images, it is necessary to extract both text blocks that are in natural text format and text blocks in image format. To achieve this, PyMuPDF is used to extract images for the specific purpose of extracting the texts within them. These images are therefore not saved or used for any other purpose. This procedure is done to make sure that all context is captured to minimize information loss. Meanwhile, the images extracted by partition_pdf are the ones used for further processing. To be able to extract the text in each image, Optical Character Recognition (OCR) is used. Pytesseract is an OCR method that uses LSTM to translate the text in images into machine-readable characters. The text data and the text extracted from the image data are stored separately in dictionaries where the key tells if the text originally belonged as "text" or "image" in the document. Tables are extracted by using the Tabula library. This library identifies table structures in the documents and stores them as a DataFrame object. 3.3 Chunking After the data extraction phase, the chunking phase takes place. The procedure involves splitting the extracted text data into varying sizes of blocks and is repeated one time for each chosen chunk size. The chosen chunk sizes are 64, 128, 256, 512, and 1024 respectively. This is done for both of the RAG frameworks. To specify the chunk size, the TokenTextSplitter module from the LangChain library is used. The data extraction phase, including chunking, is illustrated in Figure 3.1. Once the data extraction part and the chunking part are done, the further struc- tures of the two RAG frameworks differ. In the next section, the outline of the 22 3. Methodology Figure 3.1: Simplified chunking process schema. Text-only RAG framework model is presented, followed by the outline of the Mul- timodal RAG framework. All further procedures for both of the RAG frameworks take place five times each, once for the data belonging to each chunk size. 3.4 Text-only RAG In this section, the different steps of the implementation of the Text-only RAG framework are presented. The first step – generating summaries, takes place right after the data extraction phase. Then, the generating summaries phase is followed by the embedding and retrieval phase and finally, LLaMA2-7B is integrated with the framework. 3.4.1 Generating summaries When the extraction phase is done, all of the text, image, and table elements are stored in a specified output directory in dictionaries. The next step is to generate 23 3. Methodology and save summaries of the extracted data. By doing this, the different modalities of the extracted data are flattened and represented as a text-only modality. To summarize the data, either LLM or LMM is employed depending on whether the data is textual or visual. In this project, all of the LLMs and LMMs are run through ChatOllama, which allows for running the open-source models locally. This approach is chosen because a potential internal extension of the project could involve sensitive company data. This approach ensures a safe way to process data without it being exposed externally. To summarize the text chunks and tables, ChatOllama’s LLaMA2-7B is used. Each chunk, together with the ChatOllama model as a parameter, is passed into LangChain’s summarization chain where the summaries are created based on the map-reduce method. This technique is used for summarizing larger documents, as it splits them into smaller blocks and summarizes them separately in the map step, and then combines these summaries into the final summary in the reduce step. The final summaries are saved into separate text files. Since LLaMA2-7B is only capable of processing text data, it can not be used for generating image summaries. Instead, ChatOllama’s LLaVA-7B is chosen to sum- marize the extracted images. Here, instead of using LangChain’s summarization chain, a specified prompt that tells LLaVA-7B to make detailed summaries of the images is defined. It is formulated as follows: "You are an assistant whose task is to describe images for developing a Visual Question Answering tool. Provide a comprehensive description of the image, in- cluding all relevant details and elements like graphs, charts, diagrams, or textual information. Describe any notable features or patterns observed. Ensure that the description is clear, detailed, and covers all aspects of the image to facilitate un- derstanding it." The running procedure of LLaVA-7B is defined in a separate script, which together with the prompt is iterated over each extracted image, saving the generated sum- mary into a separate text file. In the end, all of the text files with summaries undergo a cleaning process, where empty lines are deleted. 3.4.2 Embedding and Retrieval Once all document data is represented as text summaries, the embedding takes place. The first step is to create a storage for the embedded vectors and for the raw data elements, by utilizing a vector store and an in-memory document store. The vector storage is created with Chroma, a module from the LangChain library, 24 3. Methodology which takes the chosen embedding model as a parameter. The chosen embedding model is Sentence Transformers from HuggingFaceEmbeddings. This model converts all the text summaries into vector representations, which are then stored in the vector store. The document storage is created with InMemoryStore from the LangChain library. The purpose of creating this storage is to keep track of the connection between raw data elements and their corresponding embedded summaries. The raw data elements serve as the parent node in the document store, and the corresponding embedded summaries serve as the child nodes in the vector store. For the retrieval phase, a MultiVectorRetriever is created, which is a module from the LangChain library. The retriever integrates the document store and the vector store by indexing. Each raw data element and its embedded summary is assigned a unique ID, which is a crucial step for the further retrieval stage. Dur- ing the retrieval, MultiVectorRetriever computes a similarity search between the embeddings stored in the vector store and the embedded user query. Then, docu- ments that have the highest semantic similarities to the query are identified. The retriever’s search parameter can be configured manually, which sets the number – top-k value, of the retrieved chunks. 3.4.3 Integrating LLaMA2-7B Once all of the extracted, summarized and embedded data is stored, an instance of ChatOllama’s LLaMA2-7B is integrated with the retriever. At this stage, the Cha- tOllama model is the central processing unit and is joined with the previous struc- ture and the final QA pipeline is created. The pipeline is constructed as a chain, consisting of a context, a question, a defined prompt, and LLaMA2-7B together with the LangChain_Core modules RunnablePassthrough, StrOutputParser and PromptTemplate. These modules help to construct the pipeline. Before these parameters are passed into the chain, a prompt template is created us- ing the PromptTemplate module, which enables the interaction within the pipeline. This template is used to construct the prompt that goes into the chain. In the template, the prompted message is formulated as follows: "Answer the question based only on the following context, which can include text and tables." The chain also takes a context and a question argument, together with the template transformed into a prompt variable. In the chain, the context is acquired from the retriever and the question is defined as RunnablePassthrough which can be filled 25 3. Methodology out by the user. When the chain is created, a user query can invoke it and then the different stages of the RAG framework are executed. A response to the question is generated and presented as the final output with the use of StrOutputParser. 3.5 Multimodal RAG In this section, the different steps of the implementation of the Multimodal RAG framework are presented. The first step – embedding and retrieval, takes place right after the data extraction phase, presented in section 3.2. The embedding and retrieval phase is followed by the stage where LLaVA-7B is integrated with the framework. 3.5.1 Embedding and Retrieval After extracting all the tables, text, and images from the raw PDF files, the embedding and retrieval process takes place. In the Multimodal RAG framework, images are embedded into the vector store alongside textual data, which results in a unified, multimodal vector store. For this purpose, the Chroma vector store and the OpenCLIPEmbedding model is used. This embedding model is an open-source implementation of OpenAI’s CLIP published in [24], which has been pre-trained on a variety of image-text pairs. It uses a contrastive learning approach to map images and text to a common embedding space. The Chroma vector store stores these embeddings in memory, organizing them in a structured database with ID keys for each document. Chroma’s function ’add_images’ stores images as base64 encoded strings so that they can be passed to an LMM like LLaVA-7B. Similarly to the Text-only RAG framework, a document store is created next to the vector store, which stores raw textual data and image metadata and is connected with the vector store by a unique ID. For the retrieval phase, LangChain’s MultiVectorRetriever is initialized to handle multiple vectors – text and image embeddings. Then, the vector store is converted to this retriever instance with a manually specified search parameter, indicating the top-k value of retrieved chunks. As in the Text-only RAG framework, the retriever uses semantic similarity search to match the user’s query with stored vector embeddings to retrieve the original context in the end. 3.5.2 Integrating LLaVA-7B To integrate ChatOllama’s LLaVA-7B with the retriever, a prompt function that formats all the retrieved context into a single string is created. An additional message for LLaVA-7B, which is added at the end of the prompt by this function, 26 3. Methodology is formulated as follows: "Provide a precise answer to the user question based on the provided context." If there are images in the retrieved context, this function creates a message con- taining an image URL. Then, the user question and formatted context texts are stored in another text message. In the end, the generated messages are returned to the prompt from the function as a HumanMessage LangChain_Core object. As it is in the Text-only RAG framework, LLaVA-7B and the final prompt are constructed as a chain. The chain uses the retriever to get the context from the documents, and the following modules – RunnablePassthrough to input user ques- tions, the aforementioned prompt function to construct the prompt for LLaVA-7B, and a StrOutputParser that outputs the generated answer. When in use, the RAG chain can be invoked with a user’s query and then the context data is retrieved and integrated into the prompt which is passed into LLaVA-7B. LLaVA-7B interprets and analyzes the retrieved multimodal information, generating responses to user queries. 3.6 Evaluation To assess the performance of the Text-only and Multimodal RAG frameworks and optimize their parameters, BERTScore and LangSmith evaluation libraries are employed. Since access to ground truth answers in this project is provided, BERTScore is used as a foundational metric for assessing the semantic accuracy of the generated answers. This metric outputs its own precision, recall and F1 scores. In addition to this, the LangSmith library is utilized, from which co- herence, relevance, and contextual accuracy metrics are selected for the project. While BERTScore is a standard metric for evaluating RAG systems, LangSmith’s metrics use the LLM-As-A-Judge approach that complements traditional metrics with human-like reasoning for texts with language nuances and deep contextual understanding. Details on how these metrics function are discussed in Section 2.1.6. Additionally, a manual qualitative evaluation of selected answers is per- formed. The specifics of the experiments performed and the parameters tested are described in the next chapter. 27 3. Methodology 28 4 Experiments The performances of the Text-only and Multimodal RAG frameworks are evaluated by quantitative and qualitative analysis. The base version of the RAG frameworks used in the experiments consists of the chunk size 256 and the top-k value 4. In the quantitative part, one component, being either chunk size or top-k value, is changed at a time, and the RAG framework’s performance is evaluated. Since the current state of evaluation of RAG frameworks focuses mainly on the LLM component in the RAG pipeline [15], it is decided that the experiments in this project would focus on evaluating the other crucial part of the framework – the retrieval stage. Therefore, the effect of two key parameters of the retriever, chunk size and top-k value, is evaluated on two different sets of scores – BERTScore and LangSmith metrics. BERTScore is used to evaluate the generation part of the framework, by calculating the similarity of embeddings between the generated and ground truth answers. It consists of three metrics that are named similarly to the conventional classification metrics, namely recall, precision, and F1 score. It is important to address that BERT metrics are different from the standard classification metrics, although they have a somewhat similar interpretation which is explained in section 2.1.6.1. To avoid confusion, from here on they are referred to as B-recall, B-precision, and B-F1, respectively. The LangSmith metrics are used to enhance evaluation by utilizing a LLM-As-A-Judge approach. The calculated metrics from this framework include coherence, relevance, and contextual accuracy. Such scores are chosen to conduct the most comprehensive evaluation of different configurations of the RAG frameworks. For the qualitative analysis, generated answers by the Text-only and the Multi- modal RAG frameworks are compared to each other as well as with generated answers by LLaVA-7B. LLaVA-7B is in this part of the analysis used as a base- line model. The same questions are asked to all three models and are categorized into different complexities. This analysis is made to visualize how the models understand the retrieved context and match the query when the questions need progressively more comprehension of the context. These responses are analyzed 29 4. Experiments qualitatively to investigate a possible correlation between the resolution of the question and the quality of the response. 4.1 Quantitative analysis The quantitative evaluation is run on a total of 100 questions per chunk size, combining 10 questions per manual for 10 different manuals. The performances are measured by changing one parameter at a time, either the chunk size or the top-k value. The chunk size evaluation is run for the five different chunk sizes of 64, 128, 256, 512, and 1024 respectively. The top-k value evaluation is run for k values of 2, 4, 6, and 8 respectively. During the chunk size parameter experiments, the default value of top-k is set to 4 and during the top-k experiments, the default value of the chunk size is set to 256. The results are presented separately for the two RAG frameworks and also separately for two evaluators – BERTScore and LangSmith. 4.2 Qualitative analysis In the qualitative analysis, the output answers obtained by the Text-only RAG and Multimodal RAG are investigated and compared to the baseline model – LLaVA- 7B. The analysis is performed on three different manuals from the dataset, and with questions of 4 levels of complexity. The complexity of a user query is determined based on the location and modality of the data needed to answer it. The four levels of complexity in user queries are the following: 1. Top page: Question with text answer located at the top of the page. 2. Middle page: Question with text answer located in the middle of the page. 3. Scattered: Question with text answer split between multiple OCR regions. 4. Multimodal: Questions requiring understanding of both text and image data. For each level of complexity, there are three questions that are fed to the models, one from each manual. The Top page complexity is designed to test the models’ abilities to interpret information that is easily accessible and does not require nav- igating through dense or overlapping data. The second level of complexity, the Middle page, is more focused and requires more precise information extraction abilities. It tests the models’ abilities to locate and interpret valuable content from dense paragraphs, located in the middle of the page among surrounding text. Advancing in complexity, Scattered level of complexity challenges models on their 30 4. Experiments ability to integrate information spread across multiple paragraphs or sections. It tests if the models can maintain integrity and coherence when the important con- text is scattered. While the described three levels of complexity require processing only text data, the last level of complexity adds another layer to analyze, which is visual content. Multimodal retrieval complexity is designed to evaluate the high- est resolution of answer generation, which integrates both textual and visual cues to generate an answer. The three manuals are fed into the Text-only RAG and the Multimodal RAG framework in order for them to generate responses to the questions. LLaVA-7B however, is not capable of taking a document as an input. It is therefore decided that an image of only the region of the page containing the correct answer is used as an input, alongside with the corresponding question. Each level progressively builds on the previous one to identify patterns and to distinguish the performance of Text-only RAG, Multimodal RAG, and baseline LLaVA-7B on increasingly complex retrieval-augmented generation tasks. 31 4. Experiments 32 5 Results In this section, the results of quantitative and qualitative analysis of Text-only and Multimodal RAG frameworks are presented. The setup of these experiments is reported in section 4. 5.1 Quantitative analysis In the first part of the quantitative analysis, the results of the chunk size exper- iments are presented, followed by the second part where the results of the top-k experiments are presented. 5.1.1 Evaluating chunk sizes Below, the values of BERTScore and LangSmith metrics are reported for the Text- only and Multimodal RAG frameworks across the varying chunk sizes 64, 128, 256, 512, and 1024. 5.1.1.1 BERTScore for Text-only RAG Table 5.1 and figure 5.1 show the results of B-precision, B-recall, and B-F1 scores for the Text-only RAG framework for the different chunk sizes. Chunk size B-precision B-recall B-F1 64 0.826 0.872 0.848 128 0.825 0.873 0.848 256 0.827 0.874 0.850 512 0.824 0.876 0.849 1024 0.827 0.877 0.851 Table 5.1: Table of BERTScore for Text-only RAG for different chunk sizes. 33 5. Results Figure 5.1: Plot of BERTScore for Text-only RAG for different chunk sizes. As can be seen, the scores for all metrics do not change significantly across the different chunk sizes but remain rather stable with only minor variations. For B- recall, the score tends to slightly increase as the chunk size rises, achieving the best performance at chunk size 1024. On the other hand, B-precision follows no clear trend. It achieves the highest performance at chunk sizes of 256 and 1024. All B-F1 scores are relatively consistent over various chunk sizes with some minor alterations only. Therefore, it can be concluded that all of the three chunk sizes 256, 512 and 1024 are the optimal ones for Text-only RAG as they strike a balance between B-precision and B-recall. These findings also demonstrate that Text-only RAG has a stable performance since it handles different chunk sizes without significant loss in either B-precision or B-recall. 5.1.1.2 BERTScore for Multimodal RAG Table 5.2 and figure 5.2 show the results of B-precision, B-recall, and B-F1 scores for the Multimodal RAG framework for the different chunk sizes. Chunk size B-precision B-recall B-F1 64 0.847 0.891 0.868 128 0.850 0.895 0.872 256 0.841 0.887 0.863 512 0.648 0.689 0.667 1024 0.563 0.598 0.580 Table 5.2: Table of BERTScore for Multimodal RAG for different chunk sizes. 34 5. Results Figure 5.2: Plot of BERTScore for Multimodal RAG for different chunk sizes. It can be seen that all the BERTScore metrics follow a similar trend. They increase slightly between chunk sizes 64 and 128, peaking in their performance at chunk size 128. For chunk sizes above 128, there is a decrease in all three metrics. Especially, for chunk sizes 512 and 1024, there is a visible drastic drop in B-precision, B-recall, and B-F1. These findings suggest that smaller chunk sizes, where size 128 is the optimal one, are more effective for precise answer generation for the Multimodal RAG. Additionally, since the Multimodal RAGas cannot handle large contexts without an overall loss in performance, it becomes its limitation compared to the Text-only RAG. 5.1.1.3 LangSmith for Text-only RAG Table 5.3 and figure 5.3 show the scores of coherence, contextual accuracy, and relevance for the Text-only RAG for the different chunk sizes. Chunk size Coherence Contextual Accuracy Relevance 64 0.820 0.480 0.690 128 0.770 0.550 0.700 256 0.920 0.520 0.740 512 0.850 0.480 0.810 1024 0.840 0.560 0.680 Table 5.3: Table of LangSmith scores for Text-only RAG for different chunk sizes. 35 5. Results Figure 5.3: Plot of LangSmith scores for Text-only RAG for different chunk sizes. What can be seen for all the metrics is that the LangSmith scores are inconsistent across the chunk sizes and no clear trend can be noticed. For coherence, there is a relatively significant increase between chunk sizes 128 and 256. For the two chunk sizes greater than 256, the coherence scores seem to get more stable, although it is not known what happens after the chunk size 1024. For the relevance scores, it can be seen that a maximum is reached at chunk size 512. The scores for contextual accuracy are inconsistent across the chunk sizes, where the scores for size 64 and 512 are similarly low around 0.48 and the scores for size 128 and 1024 both are close to 0.56. No trend can be noticed there due to the inconsistent scores. However, it can be seen that contextual accuracy has the opposite behavior of relevance. When one increases between two chunk sizes, the other one decreases. This can imply a trade-off between the inclusion of broader context and direct relevance to specific queries. The optimal chunk size for this case would be mid-range like 256 or 512, since they provide a reasonable balance across all metrics. 5.1.1.4 LangSmith for Multimodal RAG Table 5.4 and figure 5.4 show the results of coherence, contextual accuracy, and relevance for the Multimodal RAG framework for the different chunk sizes. It can be noticed that the scores for relevance and coherence seem to follow a similar trend, where a slight decrease between chunk sizes 64 and 128 is followed by a slight increase between chunk sizes 128 and 256. For chunk sizes 512 and 1024, the scores are slightly lower than for 256. For contextual accuracy, the difference between the highest and the lowest score is highly significant and appears between 36 5. Results Chunk size Coherence Contextual Accuracy Relevance 64 0.790 0.360 0.770 128 0.750 0.490 0.690 256 0.870 0.370 0.730 512 0.780 0.070 0.660 1024 0.740 0.230 0.520 Table 5.4: Table of LangSmith scores for Multimodal RAG for different chunk sizes. Figure 5.4: Plot of LangSmith scores for Multimodal RAG for different chunk sizes. chunk sizes 128 and 512 which give the scores 0.49 and 0.07 respectively. Since the score for 512 is close to 0, it makes it the least desirable configuration of the multimodal model. It can be observed that for chunk sizes larger than 256, none of the metrics seem to top off previous scores. Additionally, a similar trend that is observed for LangSmith metrics for the Text-only RAG can be spotted – contextual accuracy increases after dropping at the chunk size 512, while the two other metrics decrease at the end of the chunk sizes’ axis. Moreover, what strikes the eye are the notably lower scores for contextual accuracy for the Multimodal RAG compared to the Text-only RAG. In the case of the Multimodal RAG, the optimal chunk size according to LangSmith seems to be either 128 for maximizing contextual accuracy or 256 for maximizing coherence, without the other metrics drastically dropping. 37 5. Results 5.1.2 Evaluating top-k values Below, BERTScore and LangSmith metrics are presented for the Text-only and Multimodal RAG frameworks across the top-k values 2, 4, 6 and 8. 5.1.2.1 BERTScore for Text-only RAG Table 5.5 and figure 5.5 show the B-precision, B-recall, and B-F1 scores for the Text-only RAG framework across the different top-k values. Top-k B-precision B-recall B-F1 2 0.825 0.874 0.849 4 0.827 0.874 0.850 6 0.823 0.873 0.847 8 0.819 0.870 0.843 Table 5.5: Table of BERTScores for Text-only RAG for different top-k values. Figure 5.5: Plot of BERTScores for Text-Only RAG for different top-k values. It can be noticed that all three metrics, namely B-precision, B-recall, and B-F1, follow a similar pattern. At first, their scores slightly increase, up to top-k value 4. After top-k value 4, however, there is a decrease in performance of all three metrics for top-k values 6 and 8, with a minimum reached at top-k value 8. These findings imply that there is a short initial improvement, as more chunks are retrieved, but when the larger values of top-k are used, it can introduce noise to the retrieved context, which confuses the model in the generation stage. It can be observed that 38 5. Results the optimal top-k value for the Text-only RAG according to BERTScore seems to be 4 since all the metrics reach their maximum performance at this point. However, the changes in the scores for the different top-k values are minimal, which makes it difficult to prove this statement. These minor variations also prove that the Text-only RAG framework is rather stable and can maintain a good performance across different top-k values, as is observed for different chunk sizes. 5.1.2.2 BERTScore for Multimodal RAG Table 5.6 and figure 5.6 show the B-precision, B-recall, and B-F1 scores for the Multimodal RAG framework across the different top-k values. Top-k B-precision B-recall B-F1 2 0.812 0.861 0.835 4 0.841 0.887 0.863 6 0.845 0.888 0.866 8 0.714 0.754 0.733 Table 5.6: Table of BERTScores for Multimodal RAG for different top-k values. Figure 5.6: Plot of BERTScores for Multimodal RAG for different top-k values. As can be seen, the scores follow a similar trend as the Text-only RAG above. At first, there is a slight increase in the scores as the top-k value rises from 2 to 6, and for the top-k value 8, there is a drastic drop in performance. Similarly to Text-only RAG, the reason behind this is probably the introduction of irrelevant information in the context leading to confusion in the model. However, for Multimodal RAG 39 5. Results there is a more drastic decrease in the performance for the largest top-k value. The most optimal top-k value in this case seems to be 6, although its performance is only slightly better than at top-k value 4. On the other hand, what can be observed is that the threshold at which additional chunks introduce noise is higher for Multimodal RAG, occurring at top-k value 6, while also the performance drop is more drastic than compared to the Text-only RAG. 5.1.2.3 LangSmith for Text-only RAG Table 5.7 and figure 5.7 show LangSmith scores for the Text-only RAG framework across the different top-k values. Top-k Coherence Contextual Accuracy Relevance 2 0.960 0.520 0.720 4 0.920 0.520 0.740 6 0.810 0.560 0.760 8 0.850 0.440 0.680 Table 5.7: Table of LangSmith scores for Text-only RAG for different top-k values. Figure 5.7: Plot of LangSmith scores for Text-only RAG for different top-k values. What can be seen is that relevance and contextual accuracy follow a similar pattern, slightly increasing their scores up to top-k value 6 and then decreasing at top-k value 8. They both peak in their performances at top-k value 6. On the other hand, coherence exhibits an almost exactly reverse pattern to contextual accuracy. It peaks at the beginning at top-k value 2, then drops at top-k value 6 and slightly 40 5. Results recovers at top-k value 8. These results suggest that the optimal top-k value for the Text-only RAG according to LangSmith is between 4 and 6. All of the three metrics achieve relatively high scores for these k values. It can be also observed, based on the increasing trend in relevance and contextual accuracy up to top-k value 6, that the model benefits from additional chunks up to this point. However, coherence acts in the opposite way. While relevance and contextual accuracy benefit from more context up to a point, coherence is best maintained with fewer chunks. 5.1.2.4 LangSmith for Multimodal RAG Table 5.8 and figure 5.8 show LangSmith scores for the Multimodal RAG frame- work across the different top-k values. Top-k Coherence Contextual Accuracy Relevance 2 0.900 0.340 0.720 4 0.870 0.370 0.730 6 0.880 0.330 0.680 8 0.850 0.190 0.690 Table 5.8: Table of LangSmith scores for Multimodal RAG for different top-k values. Figure 5.8: Plot of LangSmith scores for Multimodal RAG for different top-k values. It can be observed that the coherence scores are pretty stable across different top- k values. Coherence peaks at the beginning where the top-k value is 2, similar 41 5. Results to the Text-only RAG. Also, the scores for relevance are relatively stable. There is only a slightly distinguishable peak at top-k value 4. However, for contextual accuracy, it is easier to distinguish the peak and minimum. Similar to relevance, contextual accuracy exhibits a slight maximum at top-k value 4, while for top- k value 8 – a minimum is reached. It can be also noticed that the scores for the Multimodal RAG are generally lower at almost every point compared to the Text-only RAG according to the LangSmith metrics, with contextual accuracy being most notably lower. Moreover, adding more chunks with higher top-k values improves contextual accuracy and relevance at the very beginning, up to top-k value 4. For top-k value 8, contextual accuracy significantly decreases, while other metrics remain somewhat stable but low across all values. 5.2 Qualitative analysis In this section, questions of varying complexity and the ground truth answers are presented together with the generated responses of the baseline model LLaVA-7B, the Text-only RAG, and the Multimodal RAG. This qualitative assessment is done to test how the RAG frameworks understand the retrieved context and match the query when the questions need progressively more comprehension of the context. Three manuals for different products are chosen for this analysis — a Dell cellphone manual, a Samsung vacuum cleaner manual, and a Sony laptop manual. From here on, these manuals are going to be labeled Manual 1, Manual 2, and Manual 3. The manuals are chosen because of the resemblance in structure to Wiretronic’s product manuals. One question from each complexity category – Top page, Middle page, Scattered, and Multimodal – is chosen for each manual. For each manual, the four questions of different complexity are asked to LLaVA- 7B, the Text-only RAG, and the Multimodal RAG. The responses from the three models are presented for each manual in the Appendix. 5.2.1 Manual 1 Table A.1 shows the generated responses of LLaVA-7B, the Text-only RAG, and the Multimodal RAG framework to the four questions of different complexity for the Dell cellphone manual. What can be observed, is that generally, LLaVA-7B did not perform well on the questions about Manual 1. All generated responses consist of hallucinations and information that seems to be assumptions. Since LLaVA-7B is trained on a big corpus of data, these kinds of guesses are probably generalizations of data from 42 5. Results other contexts than the one in the manual. This can particularly be seen in the responses to the questions in the categories Top page and Scattered. The response to the question in the category Middle page, however, seems to contain somewhat correct information. The correct information, however, is not as detailed as in the manual, so it is hard to distinguish if the response is based on retrieved context from the manual or if it is also based on general training data. For the question in the category Multimodal, the response is a complete hallucination. None of the information in the response can be found in the manual. For the Text-only RAG framework, the responses to the different questions are varied in their correctness. The generated responses to the questions in categories Top page and Middle page are both somewhat true. They are general and do not contain a lot of details, which makes it hard to distinguish if their correctness is due to previous training knowledge or if the regions with the ground-truth answers are actually retrieved. For the question in the category Scattered, the response is a general description and does not contain any specific details from the manual. This implies that in this case, the Text-only RAG framework did not manage to retrieve any information from any region of the manual. The same trend can be seen for the response to the question in the category Multimodal. The response is too general and not based on actual retrieved information from the manual. The Multimodal RAG framework performs slightly better and gives more accurate answers than the Text-only RAG framework and baseline LLaVA-7B. Even though the response to the question in the category Top page is incorrect, it is based on actual information in the manual that is located in another region than the ground truth. For the question in the Middle page category, the response does not seem incorrect. However, it is not based on actual information from the manual. It seems like the response is based on data that has been used to train LLaVA-7B. Even so, the response is not considered a hallucination since it is not incorrect. The response to the question in the category Scattered is correct and contains additional relevant information to the question. The same pattern is seen for the Multimodal question – the response is accurate, since it captures a lot of relevant information from the correct region, plus is supported with additional information from the manual. 5.2.2 Manual 2 Table A.2 shows the generated responses of LLaVA-7B, the Text-only RAG, and the Multimodal RAG framework to the four questions of different complexity for the Samsung vacuum cleaner manual. The quality of the generated responses by LLaVA-7B varies across the question 43 5. Results complexities. For the question in the category Top page, it manages to generate a very accurate response with no unnecessary additional information. For the question in the category Middle page, it generates a somewhat accurate response. Some parts of the response align with the ground truth, while others seem to not be retrieved from the manual. There is also additional information from the manual included in this response, but it is not directly associated with the question and hence not needed. The responses to the questions in the Scattered and the Multimodal categories seem to be complete hallucinations. The given information can not be found in the manual which also indicates that the information could have been retrieved from the general data used to train LLaVA-7B. The Text-only RAG framework performs well in generating accurate responses that are similar to the ground-truth answers for this manual. For the Top page category, the response is completely aligned with the ground truth. The response also contains additional safety information that is associated with the correct an- swer, which makes this response even more valuable than the ground truth. For the Middle page category, the generated response contains somewhat correct in- formation along with other information that is not relevant to the question. The irrelevant parts seem to be completely hallucinated. The response to the question in the category Scattered seems to follow a similar trend. The first sentence of this response is correct but is followed by a long instruction that is not related to the question but seems to be retrieved from elsewhere in the manual. For the last question, belonging to the category of Multimodal, the generated response contains hallucinated information. As it refers to specific names of parts that the question concerns it may seem true, but when reading the manual it becomes clear that the information is false. Some instructions in this response, however, are relevant. The Multimodal RAG framework seems to not perform as well as the Text-only RAG framework on questions for this manual. For the Top page category question, the generated response is completely incorrect. However, the response is retrieved from the same region as where the ground truth answer is located. This implies that the context of the response is somewhat similar to the ground truth answer even if the details are incorrect. The response for the question in the category Middle page is also incorrect. Here again, the information in the response is not a hallucination, but actual information from elsewhere in the manual. The information from the answer can be found on another page, which implies that the retriever part of this framework failed to capture the correct context. For the question in the category Scattered, the generated response seems to be incorrect too. It is not clear whether the response contains information from another region in the manual or if it is a hallucinated response. Since the response is inconsistently 44 5. Results written, it implies that it could be a completely hallucinated answer. Finally, for the question in the category Multimodal, the generated response seems to be similar to the response generated by the Text-only RAG framework. It does not provide a precise answer but gives somewhat accurate instructions that contain relevant details. 5.2.3 Manual 3 Table A.3 shows the generated responses of LLaVA-7B, the Text-only RAG, and the Multimodal RAG framework to the four questions of different complexity for the Sony laptop manual. For LLaVA-7B, it can be seen that overall it manages to capture somewhat correct information for the questions of different complexity. However, it can be observed that a big part of each generated answer contains information that is in the manual but is not related to the question. There also occur some hallucinations. For example, the answer to the question from the category Top page contains actual words from the manual, but the sequence of instructions is false. Next, the answer to the question in the Middle page category is somewhat true but contains a lot of information that is not relevant to the question and could be a hallucination. For the question in the Scattered category, the answer is not wrong but does not contain any relevant details and is unnecessarily long. It is a general description of how to use the product, which makes it hard to distinguish if the content is retrieved from the manual or if the response is based on other general data. For the question in the category Multimodal, it is clear that the response is not based on content in the manual. This response is probably based on data that has been used to train LLaVA-7B, which is a problem that also appeared when evaluating LLaVA-7B’s responses to the other two manuals. The responses generated by the Text-only RAG framework have varying quality for questions of different complexity. In general, the responses are not very accurate to the ground-truth answers for this manual. For the question in the Top page category, the response includes relevant words, but the instruction sequence is not correct and can not be found in the manual. It is most likely a hallucinated instruction. For the question in the Middle page category, the generated response is not correct. However, the content in the response is actual information retrieved from the manual, but not related to the question. The same pattern can be seen in the generated response to the question in the Scattered category. The information in the response seems to exist in the manual but is not related to the ground truth answer. This implies that the model manages to retrieve actual information from the manual, but it fails to locate the correct regions. For the Multimodal question, the generated response seems to be a general troubleshooting for how to handle 45 5. Results issues with computers. The information does not seem to be retrieved from the manual, but rather from the data used to train LLaMA2-7B. For the Multimodal RAG framework, it can be observed that the responses gener- ated for the questions of various complexity are mostly accurate. For the question in the Middle page category, the response is accurate and includes content from other pages of the manual as well. Looking at the response to the Scattered ques- tion, it can be seen that it manages to capture accurate information from the correct region. However, the response is too general and not as detailed as the ground truth which makes it a bit irrelevant to the question. For the Multimodal category, the framework manages to capture the correct answer in the very first sentence of the response. However, the response is long and the following sen- tences are not as relevant and would not have to be included for the response to be accurate. What is noticed for the Multimodal RAG framework in this case, is that all responses across the different questions contain similar information about warnings and what to be cautious about. It is hard to tell whether there is a flaw in the framework that makes it repeatedly retrieve the same information. On the other hand, using a RAG framework for question-answering about products, it would probably be preferred to receive too much rather than too little infor- mation about what to be cautious about. In general, the responses generated by the Multimodal RAG framework are somewhat accurate but contain unnecessary information. It shows only little tendencies of hallucinations which is a good in- dicator of the framework. Generally, the Multimodal RAG framework performs well on scattered questions where parts of the ground truth are located in different regions. 46 6 Discussion In this chapter, the Text-only and the Multimodal RAG frameworks will be com- pared. First, the key evaluation metrics chosen for this discussion are explained, followed by an evaluation of the effects of chunk sizes, top-k values, and the im- pact of the modality of the frameworks. The most optimal configurations of the two frameworks are compared and discussed. Then, key conclusions drawn from qualitative analysis are presented. This chapter ends with presenting suggestions for future work and discussing risk and ethical considerations. 6.1 Developing domain-specific RAG frameworks To answer the first research question posed in this project – Is it possible to inte- grate a pre-trained LLM or LMM with a retrieval component in a RAG framework to generate responses to domain-specific questions? – it is proved to not only be feasible but effective in the conducted experiments. Two RAG frameworks, Text-only and Multimodal RAG, are built and their different behavior is observed when generating answers to questions obtained from electronics-related manuals. The integration of a similarity search-based retriever with either LLaMA2-7B or LLaVA-7B proved that it is feasible to leverage pre-existing knowledge by retriev- ing specific information to navigate the answer generation phase of LLMs and LMMs in this specific domain. The effectiveness of the built frameworks is exam- ined on a set of evaluation metrics, namely BERTScore and LangSmith metrics. 6.2 Parameter and modality impact on RAG framework performance This section aims to answer the second research question – How do the modality and the parameters of a RAG framework affect the performance of generated responses?. First, the thought process behind the selection of the most relevant metrics to base 47 6. Discussion the comparison on is explained. Then, the effects of modality are discussed in a comparison between the Text-only and Multimodal RAG frameworks. Next, the specific effects of parameters, being chunk sizes and top-k values, are discussed. 6.2.1 Key evaluation metrics for the domain-specific RAG framework Given the objective of this project, which is to build a framework that most accu- rately answers users’ queries about electronics manuals, it seems like the most im- portant metric from BERTScore to base our assessment on is B-recall. It matches each token in the reference sentence to the most similar token in the generated sen- tence, while the opposite is done for B-precision. That is to say, a high B-precision score means that the information included in the generated answer is similar to the ground truth answer. However, it does not mean that all relevant information is included. A high B-recall score, on the other hand, means that all of the rele- vant information from the reference is covered in the generated answer, which is crucial for our goal. Missing out on some critical information from the reference answer could lead to incomplete or inaccurate assembly instructions which can be dangerous for workers. All necessary details should be covered in the generated answer to not cause any damage in the assembly processes. Because of this, the focus is mainly on the B-recall metric in order to find the most optimal frame- work. As for the LangSmith metrics, they are used as complementary metrics to BERTScore and will provide additional context to help capture aspects of the generated answers that might be more aligned with human judgment. It needs to be addressed that these metrics are based on an LLM assessment and its imper- fections can affect the results. Also, these metrics are returning binary scores for each data point, which makes them non-optimal as some aspects of the data can be lost. However, among the LangSmith metrics, contextual accuracy is critical since it assesses if the generated answers accurately capture the essence of the ground truth answer. Relevance is also important since it checks if the generated answers directly address the worker’s query. Coherence could be the least impor- tant out of the three, but is also interesting to evaluate as it assesses if the answer is clear and well-structured. The main focus will therefore be placed on B-recall and contextual accuracy while discussing the results. However, they will not be of the greatest importance if a high score among these metrics indicates a significant drop in other metrics. Also, more emphasis will be placed on BERTScore metrics while finding the optimal configurations of the frameworks as they are objectively and quantifiably calculated, while LangSmith involves some level of subjectivity. 48 6. Discussion 6.2.2 Effects of modality The first observation across the performances of the Text-only and the Multimodal RAG frameworks is that the Multimodal RAG framework shows greater sensitivity to changes in the chunk size. While the performance of the Multimodal RAG framework is more unstable, the Text-only RAG only shows small changes in performance across all chunk sizes. Both of the frameworks perform best with moderate chunk sizes – 128 being the optimal one for Multimodal and 256 or 512 for Text-only RAG. However, the Multimodal RAG framework seems to be more affected by larger contexts which is a clear limitation compared to the more stable Text-only RAG framework. Secondly, the top-k evaluation proves that the Text-only RAG framework is more stable. Both frameworks perform best with medium top-k values, being either 4 or 6. However, the results of the BERTScore evaluation show that the decrease in the performance for the largest top-k value is more drastic for Multimodal RAG. In addition to this, the LangSmith metrics showed that the Text-only RAG framework significantly outperforms the Multimodal RAG in contextual accuracy. Framework B- B- B-precision recall F1 Coherence Contextual Accuracy Relevance Text-only RAG (chunk size = 256, 0.827 0.874 0.850 0.920 0.520 0.740 top-k = 4) Multimodal RAG (chunk size = 128, 0.850 0.895 0.872 0.750 0.490 0.690 top-k = 4) Table 6.1: Comparison of the scores achieved by the best-performing configurations of the Text-only and Multimodal RAG frameworks. Next, to get a general overview and to compare the individual scores achieved by the best-performing configurations of the Text-only and the Multimodal RAG frameworks, Table 6.1 is created. From the previously discussed optimal configura- tions, the ones with the highest number of top scores, with relatively high B-recall and contextual accuracy, are chosen and visualized in the table. These configura- tions are the Text-only RAG framework with a chunk size of 256 and top-k value of 4 and the Multimodal RAG framework with a chunk size of 128 and top-k value of 4. It can be deducted from the table that when the most optimal chunk size and top-k value are found for the Multimodal RAG framework, it does outperform the Text-only RAG framework for the three BERTScore metrics. Looking at B- recall, the best-performing configuration of the Multimodal RAG framework scores 49 6. Discussion 0.895, while the best-performing configuration of the Text-only framework scores 0.874. However, the LangSmith metrics show otherwise. All of the LangSmith metrics are lower for the Multimodal RAG framework compared to the Text-only RAG. The score of the Text-only RAG framework for contextual accuracy is 0.520 compared to 0.490 for the Multimodal RAG framework. Coherence should be also highlighted, as it is significantly higher for the best-performing Text-only RAG con- figuration, with a score of 0.920 versus 0.750 for the best-performing Multimodal RAG framework. Observing all experiments and not only the best-performing configurations, the most notable difference can be seen in the contextual accuracy scores. These scores are significantly lower for the Multimodal RAG framework, in both the chunk size and in the top-k experiments, compared to the Text-only RAG framework. How- ever, it needs to be addressed that the contextual accuracy metric from LangSmith is based on an LLM judgment and its imperfections can affect the results. In conclusion, while the best configuration of the Multimodal RAG framework shows a better performance in B-precision and B-recall according to BERTScore, the Text-only RAG framework has a superior performance in coherence and contex- tual accuracy according to LangSmith. Lower LangSmith scores can suggest that the Multimodal RAG framework struggles more with producing coherent answers that have a logical flow and organized structure. It can also imply that it does not manage to produce answers that align with the question content or the human judgment on how to answer it. One reason behind this can be that it struggles with integrating the multimodal information into a structured and contextually accurate way, leading to confusion. The Text-only RAG framework offers more reliable performance, with more robustness to changes in the chunk size and the top-k value. This implies that without further refinement and optimization of the Multimodal RAG framework, it should not be favored over the more reliable Text-only RAG framework for the purpose of this project. 6.2.3 Effects of parameters As proven in the experiments, the two key parameters of the retrieval stage – chunk size and top-k value, do influence the performance of RAG frameworks. These two parameters both determine how much context is presented to the LLM or LMM, which affects the quality of the generated answers [15]–[17]. As demonstrated in this project, optimizing chunk size and top-k value is crucial to generate accurate and coherent responses. What distinguishes their roles is that chunk size optimizes the granularity of information in the context, while top-k values optimize the quantity of context. Even if they have different objectives, they influence the same stage of the framework but in different places, therefore also affecting each 50 6. Discussion other. To achieve the highest results, it would be best to find the most optimal combination of the two. One parameter with a high value could be balanced out by the other one with a low value. However, in this project, the goal was to test the individual effect of each parameter, therefore only one parameter at a time was changed. When discussing the effects of parameters, the individual impact of chunk sizes and top-k values separately on the Multimodal and Text-only RAG frameworks is first described. Then, collective observations are drawn on how these parameters influence RAG frameworks in general, supported by the literature. 6.2.3.1 Chunk sizes Observing the Text-only RAG framework, the BERTScore metrics show a rela- tively stable performance across the different chunk sizes. These results imply that the framework has a potential advantage in robustness against changes. There- fore, it is difficult to distinguish an optimal chunk size, since all the scores in the BERTScore evaluation are similar. Nevertheless, the best-performing configura- tion of the Text-only RAG framework could have either chunk size 128, 256, or 512. The scores for these chunk sizes do not vary significantly and offer different advantages when it comes to other metrics. Therefore, it is difficult to distinguish a single optimal chunk size due to the stability of the Text-only RAG framework. There are even higher contextual accuracy and B-recall scores observed with chunk size 1024, however, it is noticed that it has an opposite behavior to relevance. A too-high chunk size can not be chosen as the most optimal in this case, since it indicates a potential drop in the B-precision and the relevance scores. Chunk sizes 256 and 512 have a good balance between the B-precision and the B-recall scores, while chunk size 256 showcases the highest coherence score, and chunk size 512 showcases the highest relevance score. The performance of the Multimodal RAG framework is negatively affected by larger chunk sizes, which is observed in the BERTScore evaluation for all three metrics. This is a clear limitation of this framework, as it can not handle large con- texts without loss in performance. The Multimodal RAG framework performs best with smaller chunk sizes, with 128 being the most optimal one. The configuration of the Multimodal RAG framework with this chunk size gives the highest scores for B-recall, B-precision, B-F1 and contextual accuracy. The LangSmith scores for this framework showcased a similar pattern to the Text-only RAG framework, with the contextual accuracy increasing as the chunk size grows larger meanwhile the co- herence and the relevance scores decrease. However, it is observed that the changes in the contextual accuracy scores throughout different chunk sizes generally are more significant for the Multimodal RAG framework than for the Text-only RAG 51 6. Discussion framework. This further proves that the Multimodal RAG framework is more unstable and sensitive to changes in the chunk size parameter. 6.2.3.2 Top-k values The most optimal top-k value for the Text-only RAG framework according to the BERTScore evaluation seems to be 4 across all three metrics. The scores are similar for all of the BERTScore metrics across all top-k values. These results imply that to be able to distinguish an actual pattern, more values would have to be investigated. However, looking at the B-recall scores, it is seen that top- k values 2 and 4 both reach the highest score of 0.874. This indicates that the Text-only RAG framework seems to be stable and manages to perform well despite parameter changes. The LangSmith evaluation shows more unstable scores across the top-k values for the Text-only RAG framework. For the contextual accuracy metric, the highest score of 0.560 is reached for the top-k value of 6. This implies that the generated responses of the framework manage to best capture the context of the ground-truth answers when the top-k value grows larger. This is slightly unexpected, as the larger the value of k, the harder it gets to locate and retrieve correct information. The BERTScore evaluation for the Multimodal RAG framework indicates that larger values of k negatively affect the performance. The B-precision, B-recall, and B-F1 metrics all show a relatively stable performance for top-k values 1, 4, and 6 but drop drastically for the top-k value of 8. These results indicate that a larger k is not beneficial for the Multimodal RAG framework and that it doesn’t manage to capture the context across a higher amount of regions. The LangSmith evaluation shows more varying results which makes it difficult to distinguish any trend. The contextual accuracy scores, however, slightly confirm the trend observed in the BERTScore evaluation. The performance is somewhat stable for the top-k values 2, 4, and 6 but drops for the top-k value 8. 6.2.3.3 General observations As supported by the literature, smaller chunk sizes provide more specific and fo- cused information [17], which increases the precision. Larger chunk sizes could be beneficial for broader questions since they include wider contexts, but they could also add more confusing or irrelevant information, which is difficult to distinguish in the next stages of a RAG framework [17]. This is confirmed in this project, as the coherence and relevance scores are the lowest for the largest chunk size. Larger chunk sizes, particularly 512 and 1024, result in significant performance drops for the Multimodal RAG framework as it gets harder to connect longer contexts with visual cues. The Text-only RAG framework remains relatively stable for different 52 6. Discussion chunk sizes. The top-k value determines the volume of retrieved information. Low top-k values may not provide enough relevant context, which could generate incom- plete or incoherent answers, while high top-k values may overwhelm the model and make the relevant chunks harder to recognize, producing less accurate responses [15]. This is proven in the experiments where the middle top-k values (4 or 6) provide the best performance, ensuring sufficient context without including noise. Including unnecessary context also increases the computational cost by increasing the number of processed input tokens, which makes a too large chunk size or top-k value inadvisable. It can be deducted from both the chunk size and top-k values evaluation, that when evaluated on BERTScore, B-recall is consistently higher than B-precision in every experiment. This could imply that false positives slightly outnumber false negatives and that RAG frameworks have a tendency to include more tokens to avoid missing relevant ones, although the price is that it may bring in more irrelevant tokens. This is an advantage for this application since the goal is focused more on covering all necessary information (high recall) than providing relevant information (high precision). The scores for the LangSmith metrics coherence and relevance are significantly higher than for the contextual accuracy metric. Coherence being the highest scored metric across all experiments suggests that even though the RAG frameworks pro- duce organized and structured responses, they could not have a high alignment with the ground truth answers. For our application, it is crucial to provide correct responses which makes logically structured responses less prioritized than provid- ing misleading or incomplete information. In conclusion, moderate chunk sizes – 128 to 256 – and top-k values – 4 to 6 – generally return the best performance across both the RAG frameworks. Opti- mizing these parameters is crucial because the optimal chunk size will provide a good balance between the level of detail and the breadth of context, while the optimal top-k value will provide enough context without adding noise. For the future extension of the project, it is believed that it would be most beneficial to compensate a small chunk size (such as 128) with a larger top-k value (such as 6) since processing larger chunks is more computationally extensive than larger top-k values. 6.3 Qualitative analysis overview What can be seen in the qualitative results is that all generated responses by LLaVA-7B, the Text-only RAG framework, and the Multimodal RAG framework 53 6. Discussion are long and nested compared to the ground-truth answers. The ground-truth answers, however, are manually documented from the manuals which means that the length and included details have been subjectively chosen. Because of this, the correctness of the information rather than the semantic similarity is what should be compared between the generated responses and the ground truth answers. The results show that the quality of the generated responses varies a lot across the three models. No significant trend can be distinguished in any of the performances. However, the results show that the Text-only RAG framework and the Multimodal RAG framework manage to retrieve the correct answer more often than LLaVA- 7B. This observation is expected due to LLaVA-7B being used as a baseline model without any fine-tuning or retrieval component to the specific context. This ob- servation also implies that the two RAG frameworks are built and implemented successfully and serve their purpose of enhancing domain-specific performance. Looking at the performances of the Text-only and the Multimodal RAG frame- works, the quality varies a lot across the questions of the different complexities. What can be noticed for both of the RAG frameworks is that they tend to halluci- nate more for the questions in the Scattered and the Multimodal category. Most likely, to be able to distinguish any further pattern of how the performance varies across question complexity, more manuals would have to be analyzed. By doing so, it would be expected for the Multimodal RAG framework to perform better on multimodal questions. Even if the Text-only RAG framework shows to perform well on multimodal questions, keeping the image data intact would ensure that no context is lost in the process of summarizing the images into text. The expected performance for the Top page and Middle page questions would probably not differ a lot between the RAG frameworks. A contributing factor to these performances would rather be the chunk size and the top-k value. Making a qualitative analysis to test how the RAG frameworks perform on the different question complexities with different values of chunk size and top-k would be a natural extension of this project. Also, the performances of the questions in the Scattered category would probably depend on the chunk size and the top-k value. 6.4 Future work There are several areas to consider in a potential future extension of this project. Firstly, the amount of manuals used should be increased, since this project was lim- ited to analyzing only 10 manuals. Performing evaluation on more manuals would probably give clearer, more distinguishable results where the scores of the different configurations would stand out more. Similarly, the range of parameters tested should be extended in the future. It could make the findings more generalizable 54 6. Discussion and could help to further optimize the Multimodal RAG framework, which has the potential to outperform the Text-only RAG framework. Due to the limitations of the computational resources used and the time frame of the project, investigating a wider range of parameters and incorporating more manuals was not possible, but should be considered in a future extension. Secondly, the qualitative analysis clearly shows that the generated answers by the two RAG frameworks most of the time are long and unnecessarily extensive. A natural continuation would be to investigate prompting techniques to optimize the format of the responses. This type of analysis would be qualitative and de- pendent on the user’s requirements. In certain cases, extensive responses may be preferred while in other cases they would not be. Investigating different prompting techniques could also be used as a method to minimize hallucinations. As prior research shows that the format of the prompt can have an effect on how much the frameworks hallucinate, this would be a natural extension of the project. To improve the performance and efficiency of the parameter tuning experiments, it would be valuable to integrate dynamic chunk size and top-k retrieval mechanisms. These techniques could adjust the parameter value based on the complexity of the query. For more complex queries, more context would be retrieved but the parameter value would not be static, ensuring lower computational costs [20]. Ad- ditionally, re-ranking techniques that make sure that the best matching chunks are retrieved could be explored to even further improve the accuracy of the gen- erated responses. However, it would come with additional computational costs [19]. These two methods could further optimize the RAG frameworks built in this project. Also, this project focuses on tuning parameters that influence the retriever compo- nent of a RAG framework. However, since the main components of this framework are both the retriever and the generative component, it would be beneficial to ex- plore how different LLMs or LMMs affect performance. In this project, LLaMA2- 7B and LLaVA-7B are implemented, but it would be valuable to try some models that have more parameters or are generally newer, as they may offer different advantages. Another natural extension of this project would be to include a more thorough and separate evaluation of the retriever. Even though the parameters of the retrieval component are being optimized, only the final generated answers are evaluated. Evaluating the retrievers performance independently by printing and evaluating the retrieved chunks would provide more insights into the retrieval phase’s actual effectiveness. One approach to do that would be to manually select regions of manuals where the ground truth is located since the dataset used in this project 55 6. Discussion does not provide them, and use them to calculate the page-level or paragraph-level accuracy. Another solution would be to use an LLM-As-A-Judge approach to ask the LLM to reason about the relevancy of the retrieved chunks to a user’s query. Further on, different retrievers could be compared that employ other techniques than the vector-based similarity search used in this project. Overall, even if the generative component of the RAG framework manages to produce accurate re- sponses based on the information that is retrieved, it is crucial to evaluate how accurate the retrieved information is. Another potential future research area could be to further investigate the efficiency of the two RAG frameworks. Evaluating the frameworks based on a cost-efficiency approach could give valuable insights. Measuring the running times for different tasks for the two frameworks would allow for taking it into account in the overall evaluation. If the frameworks show very similar performances but the running times drastically differ, prioritizing which framework is the most suitable for certain tasks would become easier. As the running times for the extraction, summarization and evaluation are long, another future area of interest would be to implement parallel programming. Splitting the runs across several cores would most likely minimize the running time, enhancing the overall efficiency of the frameworks. 6.5 Risk analysis and ethical considerations While industrial automation comes with a lot of advantages, certain risks need to be considered. An LLM may have tendencies to hallucinate, which will make it generate false responses. Providing false information about the production may result in security risks and danger for the workers as well as quality issues with the products. A careful evaluation of the final model and human supervision is therefore crucial before real-world usage and implementation. Another ethical aspect to consider is how employees will be affected by this kind of automated tool. Regular workflows and the need for human labor may be affected which are aspects that come with both advantages and disadvantages which should be weighed against each other. Other ethical aspects to consider are that the company data should be treated carefully and according to agreement. The open-source data that is used follows the GDPR. 56 7 Conclusion In this project, a Text-only and a Multimodal RAG framework are developed to enhance the performance of LLMs and LMMs by integrating knowledge from an external database created from electronics user manuals. By doing so, the goal of answering the two research questions posed in the beginning is met, namely: • Is it possible to integrate a pre-trained LLM or LMM with a retrieval model in a RAG framework to generate responses to domain-specific questions? • How do the modality and parameters of the RAG framework affect the per- formance of the generated responses? For the first research question, it is proved that it is feasible and effective to connect a similarity search-based retriever with either LLaMA2-7B or LLaVA-7B. This integration is achieved with the help of the LangChain library and the Chroma vector store. The final architectures of the two frameworks differ. The Text-only RAG framework first employs LLaMA2-7B to generate summaries of text and tables extracted from the raw PDFs and uses LLaVA-7B to summarize features of extracted images. Then, a Multi-Vector Retriever, that retrieves image summaries and raw text and tables, is used to prompt LLaMA2-7B to generate a final response. The Multimodal RAG framework uses CLIP embeddings to create a unified vector space for text and image data and then connects it by a Multi-Vector Retriever to LLaVA-7B. Both frameworks generate answers at a satisfying level of B-recall, although the behaviors differ in various scenarios. Through qualitative analysis, it is proved that both frameworks manage to retrieve the correct answer more often than the baseline model LLaVA-7B, which implies that they serve their purpose of enhancing the domain-specific performance. The second research question is addressed by evaluating two key parameters of the retriever – chunk size and top-k value – as well as investigating how the modality of the frameworks affects the performance. The findings prove that the performance of the Multimodal RAG framework is more sensitive to changes in both chunk size and top-k value, while the Text-only RAG framework is more stable. Generally, 57 7. Conclusion moderate chunk sizes – 128 or 256 – and top-k values – 4 or 6 – returned the best scores across both frameworks. When the optimal parameters for the Multimodal RAG framework are found, it slightly outperforms the Text-only RAG framework according to the BERTScore metrics. However, the Text-only RAG framework shows superior performance in coherence and contextual accuracy according to the LangSmith metrics when testing the different parameter configurations. This implies that the Text-only RAG framework is more reliable because it generates more coherent and rational responses. Ultimately, it is the preferred framework to use to achieve the goal of this project as stability and reliability are crucial aspects of assembly processes. Even though the Multimodal RAG framework shows its potential with the right optimization, it would require additional tuning to improve its stability and to possibly outperform the Text-only RAG framework. The final product of the project is the optimization of the two RAG frameworks to a significant extent. The aforementioned findings of this project can be later used as a foundation for further research and development in the field of automated assembly verification and RAG-based VQA. The future work for enhancing the VQA system proposed in this project should aim to refine the Multimodal RAG framework in order to boost its stability and performance. Furthermore, there is a need to elaborate on ethical aspects. Especially since the trustworthiness of the responses is crucial in this domain, to not lead to any misinformation and resulting errors. 58 Bibliography [1] A. Nandy, S. Sharma, S. Maddhashiya, K. Sachdeva, P. Goyal, and N. Gan- guly, “Question answering over electronic devices: A new benchmark dataset and a multi-task learning based qa framework,” arXiv preprint arXiv:2109.05897, 2021. [2] Y. Qin, “Application of virtual process verification in production preparation of instrument panel assembly line,” in Proceedings of China SAE Congress 2021: Selected Papers, Springer, 2022, pp. 932–940. [3] S. K. Sampat, Y. Yang, and C. Baral, “Visuo-linguistic question answering (vlqa) challenge,” arXiv preprint arXiv:2005.00330, 2020. [4] S. Antol, A. Agrawal, J. Lu, et al., “Vqa: Visual question answering,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433. [5] H. Liu, C. Li, Y. Li, and Y. J. Lee, “Visual instruction tuning,” arXiv preprint arXiv:2304.08485, 2023. [6] H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved baselines with visual instruc- tion tuning,” arXiv preprint arXiv:2310.03744, 2023. [7] L. Zhang, A. Hu, J. Zhang, S. Hu, and Q. Jin, “Mpmqa: Multimodal question answering on product manuals,” arXiv preprint arXiv:2304.09660, 2023. [8] R. Tito, D. Karatzas, and E. Valveny, “Hierarchical multimodal transformers for multipage docvqa,” Pattern Recognition, vol. 144, p. 109 834, 2023. [9] R. Awal, L. Zhang, and A. Agrawal, “Investigating prompting techniques for zero-and few-shot visual question answering,” arXiv preprint arXiv:2306.09996, 2023. [10] H. Touvron, T. Lavril, G. Izacard, et al., “Llama: Open and efficient foun- dation language models.,” arXiv preprint arXiv:2302.13971., 2023. [11] O. Oded, M. Brief, M. Moshik, and E. Oren, “Fine-tuning or retrieval? com- paring knowledge injection in llms,” arXiv preprint arXiv:2312.05934., 2023. [12] A. Balaguer, V. Benara, R. Cunha, et al., “Rag vs fine-tuning: Pipelines, tradeoffs, and a case study on agriculture.,” arXiv e-prints, arXiv-2401., 2024. 59 Bibliography [13] Y. Gao, Y. Xiong, X. Gao, et al., “Retrieval-augmented generation for large language models: A survey,” arXiv preprint arXiv:2312.10997, 2023. [14] Multi-vector retriever for rag on tables, text, and images, Accessed: 24-03- 2024. [Online]. Available: https://blog.langchain.dev/semi-structured- multi-modal-rag/. [15] Y. Lyu, Z. Li, S. Niu, et al., “Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models,” arXiv preprint arXiv:2401.17043, 2024. [16] S. Setty, K. Jijo, E. Chung, and N. Vidra, “Improving retrieval for rag based question answering models on financial documents,” arXiv preprint arXiv:2404.07221, 2024. [17] R. Schwaber Cohen, Chunking strategies for llm applications, Accessed 15- 04-2024, Jun. 2023. [Online]. Available: https://www.pinecone.io/learn/ chunking-strategies/. [18] R. T. Ashish AbrahamMór Kapronczay,Optimizing RAG with Hybrid Search & Reranking | VectorHub by Superlinked — superlinked.com, Accessed 21- 05-2024, 2024. [Online]. Available: %5Curl%7Bhttps://superlinked.com/ vectorhub/articles/optimizing-rag-with-hybrid-search-reranking% 7D. [19] J. Chen, Optimizing RAG with Rerankers: The Role and Trade-offs - Zilliz blog — zilliz.com, Accessed 21-05-2024, 2024. [Online]. Available: %5Curl% 7Bhttps://zilliz.com/learn/optimize-rag-with-rerankers-the- role-and-tradeoffs#How-Does-a-Reranker-Enhance-Your-RAG-Apps% 7D. [20] S. Joshi, Optimizing Retrieval Augmentation with Dynamic Top-K Tun- ing for Efficient Question Answering — sauravjoshi23, Accessed 21-05-2024, 2023. [Online]. Available: %5Curl%7Bhttps://medium.com/@sauravjoshi23/ optimizing-retrieval-augmentation-with-dynamic-top-k-tuning- for-efficient-question-answering-11961503d4ae%7D. [21] T. Kagaya, T. J. Yuan, Y. Lou, et al., “Rap: Retrieval-augmented plan- ning with contextual memory for multimodal llm agents,” arXiv preprint arXiv:2402.03610., 2024. [22] N. Kandpal, H. Deng, A. Roberts, E. Wallace, and C. Raffel, “Large language models struggle to learn long-tail knowledge,” in International Conference on Machine Learning, PMLR, 2023, pp. 15 696–15 707. [23] Y. Zhang, Y. Li, L. Cui, et al., “Siren’s song in the ai ocean: A survey on hallucination in large language models,” arXiv preprint arXiv:2309.01219, 2023. 60 Bibliography [24] A. Radford, J. W. Kim, C. Hallacy, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, PMLR, 2021, pp. 8748–8763. [25] C. Li, C. Wong, S. Zhang, et al., “Llava-med: Training a large language-and- vision assistant for biomedicine in one day,” arXiv preprint arXiv:2306.00890, 2023. [26] J.-Y. Yao, K.-P. Ning, Z.-H. Liu, M.-N. Ning, and L. Yuan, “Llm lies: Hallu- cinations are not bugs, but features as adversarial examples,” arXiv preprint arXiv:2310.01469, 2023. [27] H. Liu, W. Xue, Y. Chen, et al., “A survey on hallucination in large vision- language models,” arXiv preprint arXiv:2402.00253, 2024. [28] U. Lee, M. Jeon, Y. Lee, et al., “Llava-docent: Instruction tuning with mul- timodal large language model to support art appreciation education,” arXiv preprint arXiv:2402.06264, 2024. [29] P. Liang, R. Bommasani, T. Lee, et al., “Holistic evaluation of language models,” arXiv preprint arXiv:2211.09110, 2022. [30] L. Zheng, W.-L. Chiang, Y. Sheng, et al., “Judging llm-as-a-judge with mt- bench and chatbot arena,” Advances in Neural Information Processing Sys- tems, vol. 36, 2024. [31] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” arXiv preprint arXiv:1904.09675, 2019. [32] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: A method for auto- matic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318. [33] How to use off-the-shelf evaluators, Accessed 18-03-2024, 2024. [Online]. Avail- able: https : / / docs . smith . langchain . com / old / evaluation / faq / evaluator-implementations. [34] T. Wang, J. Li, Z. Kong, X. Liu, H. Snoussi, and H. Lv, “Digital twin im- proved via visual question answering for vision-language interactive mode in human–machine collaboration,” Journal of Manufacturing Systems, vol. 58, pp. 261–269, 2021. [35] D. Cain, Retrieval augmented generation and the evolution of large language models. Accessed: 2024-03-24. [Online]. Available: https://www.linkedin. com/pulse/retrieval-augmented-generation-evolution-large-language- david-cain/. [36] Y. Tang and Y. Yang, “Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries,” arXiv preprint arXiv:2401.15391., 2024. 61 Bibliography 62 A Appendix 1 I A. Appendix 1 II Figure A.1: Generated responses by LLaVA-7B, the Text-only RAG and the Mul- timodal RAG framework for questions about the Dell manual. A. Appendix 1 Figure A.2: Generated responses by LLaVA-7B, the Text-only RAG and the MIuIlI- timodal RAG framework for questions about the Samsung manual. A. Appendix 1 IV Figure A.3: Generated responses by LLaVA-7B, the Text-only RAG and the Mul- timodal RAG framework for questions about the Sony manual. A. Appendix 1 Figure A.4: An example page from one of the manuals chosen for evaluation. V