Fast visual grounding in interaction
A big challenge for the development situated agents is that they need to be capable of grounding real objects of their enviroment to representations with semantic meaning, so they can be comunicated to human agents using the human language. de Graaf (2016) developed the KILLE framework, which is a static camerabased robot capable of learning objects and spatial relations from very few samples using image processing algorithms suitable for learning from few samples. However, this framework has a major shortcoming: the time needed to recognise an object increased greatly as the system learned more objects, which motivates us to design a more efficient object recognition module. The following project researches a way to improve object recognition of the same robot framework using a neural network approach suitable for learning from very few image samples: Matching Networks (Vinyals et al., 2016). Our work also investigates how transfer learning form large datasets could be used to improve the object recognition performance and to make learning faster, which are very important features for a robot that interacts online with humans. Therefore, we evaluate the performance of our situated agent with transfer learning from pre-trained models and different conversational strategies with a human tutor. Results show that the robot system is capable of training models really fast and gets very good object recognition performance for small domains.