Computational Models of Language and Vision: Studies of Neural Models as Learners of Multi-Modal Knowledge
Abstract
This thesis develops and evaluates computational models that generate natural language descriptions of visual content. We build and examine models of language and vision to gain a deeper understanding of how they reflect the relationship between the two modalities. This understanding is crucial for performing computational tasks. The first part of the thesis introduces three studies that inspect the role of self-attention in three different self-attention blocks of the object relation transformer model. We examine attention heatmaps to understand how the model connects different words, objects, and relations within the tasks of image captioning and image paragraph generation. We connect our interpretation of what the model learns in self-attention weights with insights from theories about human cognition, visual perception, and spatial language. The three studies in the second part of the thesis investigate how representations of images and texts can be applied and learned in task-specific models for image paragraph generation, embodied question answering, and variation in human object naming.The last two studies in the third part examine properties of human-generated texts that multi-modal models are expected to acquire in image paragraph generation as well as perceptual category description and interpretation tasks. We analyse discourse structure in image paragraphs produced with different decoding methods. We also inspect whether models of perceptual categories can abstract from visual representations and use this knowledge to generate descriptions that exhibit discriminativity levels important for the task. We show how automatic measures for evaluating text generation behave in a comparison of model-generated and human-generated image descriptions. This thesis presents several contributions. We illustrate that, under specific modelling conditions, self-attention can capture information about the relationship between objects and words. Our results emphasise that the specifics of the task determine the manner and context in which different modalities are processed, as well as the degree to which each modality contributes to the task. We demonstrate that while favoured by automatic evaluation metrics in different tasks, machine-generated image descriptions lack the discourse complexity and discriminative power that are often important for generating better, human-like image descriptions.
Parts of work
How Vision Affects Language: Comparing Masked Self-Attention in Uni-Modal and Multi-Modal Transformer. Nikolai Ilinykh and Simon Dobnik. 2021. In Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR), pages 45–55, Groningen, Netherlands (Online). Association for Computational Linguistics.
https://aclanthology.org/2021.mmsr-1.5/ What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations. Nikolai Ilinykh and Simon Dobnik. 2021. Frontiers in Artificial Intelligence: Identifying, Analyzing, and Overcoming Challenges in Vision and Language Research, 4, 767971.
http://dx.doi.org/10.3389/frai.2021.767971 Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer. Nikolai Ilinykh and Simon Dobnik. 2022. In Findings of the Association for Computational Linguistics: ACL 2022, pages 4062–4073, Dublin, Ireland. Association for Computational Linguistics.
https://aclanthology.org/2022.findings-acl.320/ When an Image Tells a Story: The Role of Visual and Semantic Information for Generating Paragraph Descriptions. Nikolai Ilinykh and Simon Dobnik. 2020. In Proceedings of the 13th International Conference on Natural Language Generation, pages 338–348, Dublin, Ireland. Association for Computational Linguistics.
https://aclanthology.org/2020.inlg-1.40/ Look and Answer the Question: On the Role of Vision in Embodied Question Answering. Nikolai Ilinykh, Yasmeen Emampoor, and Simon Dobnik. 2022. In Proceedings of the 15th International Conference on Natural Language Generation, pages 236–245, Waterville, Maine, USA and virtual meeting. Association for Computational Linguistics.
https://aclanthology.org/2022.inlg-main.19/ Context matters: evaluation of target and context features on variation of object naming. Nikolai Ilinykh and Simon Dobnik. 2023. In Proceedings of the 1st Workshop on Linguistic Insights from and for Multimodal Language Processing, pages 12–24, Ingolstadt, Germany. Association for Computational Linguistics.
https://aclanthology.org/2023.limo-1.3/ Do Decoding Algorithms Capture Discourse Structure in Multi-Modal Tasks? A Case Study of Image Paragraph Generation. Nikolai Ilinykh and Simon Dobnik. 2022. In Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 480–493, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
https://aclanthology.org/2022.gem-1.45/ Describe Me an Auklet: Generating Grounded Perceptual Category Descriptions. Bill Noble* and Nikolai Ilinykh*. 2023. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9330–9347, Singapore. Association for Computational Linguistics. *Equal contribution.
https://aclanthology.org/2023.emnlp-main.580/
Degree
Doctor of Philosophy
University
Göteborgs universitet. Humanistiska fakulteten
University of Gothenburg. Faculty of Humanities
Institution
Department of Philosophy, Linguistics and Theory of Science ; Institutionen för filosofi, lingvistik och vetenskapsteori
Disputation
11 juni 2024, kl. 13:00, sal J222, Humanisten, Renströmsgatan 6
Date of defence
2024-06-11
nikolai.ilinykh@gu.se
Other description
Denna avhandling utvecklar och utvärderar datormodeller som genererar beskrivningar i naturligt språk av visuellt innehåll. Vi byggar och undersöker modeller av språk och seende för att få en djupare förståelse för hur de reflekterar relationen mellan de två modaliteterna. Denna förståelse är avgörande vid utförande av olika uppgifter. Avhandlingens första del introducerar tre studier som undersöker vilken roll självuppmärksamhet (self-attention) spelar i tre olika självuppmärksamhetsblock i transformermodellen för objektrelationer. Vi undersöker uppmärksamhetskartor (attention heatmaps) för att förstå hur modellen kopplar samman olika ord, objekt och relationer vid bildtextning och bildparagrafgenerering. Vi kopplar ihop vår tolkning av vad modellen lär sig i självuppmärksamhetsvikterna med insikter från teorier om mänsklig kognition, visuell perception och spatialt språk.
De tre studier i avhandlingens andra del undersöker hur multimodala representationer av bilder och texter kan tillämpas och läras i uppgiftsspecifika modeller för bildparagrafgenerering, förkroppsligat frågebesvarande och variation i mänsklig objektbenämning.
De två sista studierna i avhandlingens tredje del undersöker egenskaper hos mänskligt framställda texter som multimodala modeller förväntas förvärva vid bildparagrafgenerering samt beskrivning och tolkning av perceptuella kategorier.
Vi analyserar diskursstrukturer i bildparagrafer skapade med olika avkodningsmetoder.
Vi undersöker också huruvida modeller av perceptuella kategorier kan abstrahera från visuella representationer och använda denna kunskap för att generera beskrivningar som kan diskriminera på nivåer som är viktiga för uppgiften.
Vi visar hur automatiska åtgärder för att utvärdera textgenerering beter sig i en jämförelse av modellgenererade och mänskliga genererade bildbeskrivningar. Avhandlingen presenterar flera bidrag.
Vi visar att självuppmärksamhet under specifika modelleringsförhållanden kan reflektera information om förhållandet mellan objekt och ord.
Våra resultat indikera att en uppgifts specifika utformning avgör på vilket sätt och i vilken kontext olika modaliteter bearbetas, samt i vilken utsträckning varje modalitet bidrar till uppgiften.
Vi demonstrerar att medan datorgenererade bildbeskrivningar favoriseras av automatiska utvärderingsmått vid olika uppgifter, saknar de den diskurskomplexitet och diskriminativa kraft som ofta är viktig för att generera bättre och mer mänskliga bildbeskrivningar.
Date
2024-05-16Author
Ilinykh, Nikolai
Keywords
language and vision
self-attention
multi-modal representation learning
evaluation of language models
computational linguistics
machine learning
Publication type
Doctoral thesis
ISBN
978-91-8069-767-5 (PRINT)
978-91-8069-768-2 (PDF)
Language
eng