A list of productive vocabulary generated from second language learners' essays
Corpora for second language (L2) learning may contain a receptive vocabulary, i.e., vocabulary that is understandable by learners or productive vocabulary that L2 learners themselves are able to actively use. Corpora containing productive vocabulary could assist both students and teachers, e.g. tracking the actual learning progress, as well as language technologists who wish to analyse L2 learners' language. While there exist productive vocabulary lists in other languages, such as the English Vocabulary Profile list, none have been made for Swedish. In this paper, we describe our project to create a Swedish vocabulary list generated from a learners' corpus, which consists of a number of second language (L2) learner essays collected into an electronic corpus. The list, named SweLL-list, contains normalised lemma and part-of-speech tag combinations and their frequency counts. We present the work that was done to create a part of this learner corpus and the list based on it. Furthermore, we detail a normalisation algorithm, based on Levenshtein distance, used to correct L2 word level errors. We then proceed to describe our list in detail and analyse this resource through a comparison to SVALex, a vocabulary list based on L2 reading comprehension materials. Finally we examine the results of the aforementioned normalisation algorithm. From examining the SweLL-list and comparing it to SVALex, we got some indications on the L2 students' progress. For example, we saw that while a great part of the vocabulary is taught at the intermediate levels, the students' productive vocabulary does not increase accordingly until the proficient levels. Our analysis of the performance of Levenshtein distance for correcting L2 word level errors showed promise, especially for longer words (more than 4 characters) and where only one spelling error had been made. In order to improve the normalisation for multiple errors and shorter words, more work is needed, possibly combining the Levenshtein distance with other language technology tools.