Drawing with a social robot: Evaluating a vision-language model on spatial prepositions in an L2 drawing-based learning task
Drawing with a social robot: Evaluating a vision-language model on spatial prepositions in an L2 drawing-based learning task
Abstract
This study develops and evaluates a multimodal pipeline for studying interactive drawing
in a robot-assisted language learning scenario with human participants. A social
robot is paired with a vision-language model (VLM) that interprets freehand drawings.
The system processes the sketches as they are drawn by the participant, capturing an
image every five seconds. At each capture it detects objects and spatial relations and
integrates this information into the robot’s verbal feedback, which evaluates the drawing’s
correctness. A Wizard of Oz (WoZ) manual evaluation guides this feedback in
real time and guards the human-robot interaction against model errors. Correctness of
the drawings is evaluated at two levels: the entire sketch and its individual components
(objects and relations). The study also examines how the agent’s outward appearance
and the type of feedback influence participants’ perceptions of the interaction during the
drawing tasks. Findings indicate that VLM captures objects and the spatial relations between
them with moderate accuracy, while its behaviour remains unstable across drawing
stages. In practice, the model could function well as an initial filter, but participants’
guidance still requires human judgement. Appearance of the agent seems to be more
important than the feedback but within the existing sample we cannot confirm a consistent
advantage for either factor. With minor adaptations, the pipeline can be reused by
researchers to explore other language pairs (beyond the current Greek-English setup),
additional spatial prepositions and alternative object sets. Future work will focus on
expanding the study and refining the model.
Degree
Student essay
Collections
View/ Open
Date
2025-11-06Author
Daniilidou, Viktoria Paraskevi
Keywords
drawing, vision and language models, social robot, spatial prepositions, language learning
Language
eng