Drawing with a social robot: Evaluating a vision-language model on spatial prepositions in an L2 drawing-based learning task
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This study develops and evaluates a multimodal pipeline for studying interactive drawing in a robot-assisted language learning scenario with human participants. A social robot is paired with a vision-language model (VLM) that interprets freehand drawings. The system processes the sketches as they are drawn by the participant, capturing an image every five seconds. At each capture it detects objects and spatial relations and integrates this information into the robot’s verbal feedback, which evaluates the drawing’s correctness. A Wizard of Oz (WoZ) manual evaluation guides this feedback in real time and guards the human-robot interaction against model errors. Correctness of the drawings is evaluated at two levels: the entire sketch and its individual components (objects and relations). The study also examines how the agent’s outward appearance and the type of feedback influence participants’ perceptions of the interaction during the drawing tasks. Findings indicate that VLM captures objects and the spatial relations between them with moderate accuracy, while its behaviour remains unstable across drawing stages. In practice, the model could function well as an initial filter, but participants’ guidance still requires human judgement. Appearance of the agent seems to be more important than the feedback but within the existing sample we cannot confirm a consistent advantage for either factor. With minor adaptations, the pipeline can be reused by researchers to explore other language pairs (beyond the current Greek-English setup), additional spatial prepositions and alternative object sets. Future work will focus on expanding the study and refining the model.