IŻ SWÓJ JĘZYK MAJĄ! An exploration of the computational methods for identifying language variation in Polish
Abstract
Computational approaches to language variation continue to contribute in a relevant way to various
fields, including Natural Language Processing (NLP) and linguistics. Being able to accommodate
variation within natural language increases the robustness of NLP models and their usefulness in
real-life applications; simultaneously, detecting and describing variation and trends that govern it
is one of the main goals of sociolinguistics and historical linguistics, meaning that some of the
advances in NLP can contribute to these fields as well. As one of the current trends in historical
linguistics appears to be quantitative and corpus research, the need for annotated historical data is
becoming more and more apparent.
Within this thesis, a selection of tools and methods are tested for their ability to detect variation
between a manually annotated sample of non-standard historical Polish and corpora of modern
Polish and tools based on them. The experiments include part-of-speech tagging with two tagsets,
lemmatization, vocabulary comparisons, and an n-gram analysis. The results reveal what kinds
of variation each approach can discover in the text and to what extent. Since the majority of the
presented methods require the data to be annotated, they would be time- and resource-consuming
if applied to larger corpora; nevertheless, they do reveal certain trends in variation as well as information
on what kind of preprocessing may be needed for this sort of data to be successfully
automatically annotated, which could enable the creation of a larger corpus, facilitating further
research. Additionally, a comparison of tagging and lemmatizing performance of various tools on
modern Polish is presented, and the annotated historical text itself constitutes a relevant contribution
as well.
Degree
Student essay
Collections
View/ Open
Date
2023-06-19Author
Szawerna, Maria Irena
Keywords
language variation, Polish, diachronic linguistics, part-of-speech tagging, lemmatization, corpus linguistics
Language
eng