Using Language Models to evaluate annotation bias in the Manifesto Project Corpus
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Good data is crucial for good research. Many organisations collect and provide data for research, in hopes of furthering our understanding of society. One such organisation is the Manifesto Project which curates a dataset that collects election manifestos as both raw text and encoded data, detailing the exact topics mentioned in the manifesto. While widely used, it is also heavily criticized. In this thesis we aim to analyze whether coder bias is noticeably present in this corpus. We will focus on the bias that can arise because the coders are aware of the party a manifesto belongs to, which could allow preconceptions to influence the coding process. We utilize a cosine similarity analysis of sentence embeddings and fine-tuned RoBERTa models to analyze the data. The cosine similarity analysis shows widespread and general inconsistencies in the coding process as similar sentences are often coded differently. The RoBERTa models show that green parties are treated differently regarding the topic of environmental protection. For instance, their sentences about agriculture are overproportionally assigned the code environmental protection, while segments discussing sustainability are over-proportionally assigned the topic “anti-growth economy” when compared to non-green parties. The actual content of these segments is similar, but the coding seems to be influenced by the party affiliation. Overall, our results indicate the presence of a coder bias in the Manifesto Project data, which researchers and practitioners should be aware of when using this dataset in their work.