IŻ SWÓJ JĘZYK MAJĄ! An exploration of the computational methods for identifying language variation in Polish

Szawerna, Maria Irena
University of Gothenburg / Department of Philosophy,Lingustics and Theory of Scienceeng
Göteborgs universitet / Institutionen för filosofi, lingvistik och vetenskapsteoriswe
2023-06-19T08:05:50Z
2023-06-19T08:05:50Z
2023-06-19
Computational approaches to language variation continue to contribute in a relevant way to various fields, including Natural Language Processing (NLP) and linguistics. Being able to accommodate variation within natural language increases the robustness of NLP models and their usefulness in real-life applications; simultaneously, detecting and describing variation and trends that govern it is one of the main goals of sociolinguistics and historical linguistics, meaning that some of the advances in NLP can contribute to these fields as well. As one of the current trends in historical linguistics appears to be quantitative and corpus research, the need for annotated historical data is becoming more and more apparent. Within this thesis, a selection of tools and methods are tested for their ability to detect variation between a manually annotated sample of non-standard historical Polish and corpora of modern Polish and tools based on them. The experiments include part-of-speech tagging with two tagsets, lemmatization, vocabulary comparisons, and an n-gram analysis. The results reveal what kinds of variation each approach can discover in the text and to what extent. Since the majority of the presented methods require the data to be annotated, they would be time- and resource-consuming if applied to larger corpora; nevertheless, they do reveal certain trends in variation as well as information on what kind of preprocessing may be needed for this sort of data to be successfully automatically annotated, which could enable the creation of a larger corpus, facilitating further research. Additionally, a comparison of tagging and lemmatizing performance of various tools on modern Polish is presented, and the annotated historical text itself constitutes a relevant contribution as well.en
https://hdl.handle.net/2077/77238
engen
HumanitiesTheology
language variation, Polish, diachronic linguistics, part-of-speech tagging, lemmatization, corpus linguisticsen
IŻ SWÓJ JĘZYK MAJĄ! An exploration of the computational methods for identifying language variation in Polishen
Text
Student essay
H2

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Szawerna_Maria.pdf
Size:
1.05 MB
Format:
Adobe Portable Document Format
Description:
Master thesis

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
4.68 KB
Format:
Item-specific license agreed upon to submission
Description: