From corpus to language classroom: reusing Stockholm Umeå Corpus in a vocabulary exercise generator SCORVEX
In this master thesis the focus has been made on the evaluation of Stockholm Umeå Corpus (SUC) as a source of teaching materials for learners of Swedish as a Second language. The evaluation has been carried out both practically and theoretically. On the theoretical side, readability tests have been run on all SUC texts to analyze whether appropriate texts can be automatically selected for each proficiency level. To make readability analysis more “vocabulary aware” lexical frequency profile of each text has been collected, analyzed and embedded into the final readability score assigned to each text. SUC has proven to be a rich source of texts of different proficiency levels appropriate for language training purposes. Advantages and disadvantages of SUC as a source of pedagogical materials have been identified in the course of work. On the practical side, as a side effect of the theoretical analysis, a pedagogical tool SCORVEX (Swedish CORpus-based Vocabulary EXercise generator) has been designed and implemented. The existing modules of SCORVEX demonstrate to which extent it is possible to generate pedagogically acceptable vocabulary items with SUC as the only language resource. I am demonstrating in the thesis how wordbank items, multiple choice items and c-tests can be automatically generated for a specified proficiency level, word frequency band and a specified wordclass. In yes/no items potential words are generated on the basis of existing morphemes. All the four modules are therefore “language-aware”. Accessing frequency data obtained from SUC is the pre-requisite for the exercise generation, whereas SUC text archive is the only source of texts, sentences and words for vocabulary items. This thesis can hopefully wake interest among teachers to test this generator in real-life conditions and maybe even convince some teachers in the usefulness of this pedagogical tool. The numerous ways for further development of this software are outlined in the paper.