Using Language Models to evaluate annotation bias in the Manifesto Project Corpus
Abstract
Good data is crucial for good research. Many organisations collect and provide
data for research, in hopes of furthering our understanding of society. One such
organisation is the Manifesto Project which curates a dataset that collects election
manifestos as both raw text and encoded data, detailing the exact topics mentioned
in the manifesto. While widely used, it is also heavily criticized.
In this thesis we aim to analyze whether coder bias is noticeably present in this
corpus. We will focus on the bias that can arise because the coders are aware of
the party a manifesto belongs to, which could allow preconceptions to influence the
coding process. We utilize a cosine similarity analysis of sentence embeddings and
fine-tuned RoBERTa models to analyze the data.
The cosine similarity analysis shows widespread and general inconsistencies in the
coding process as similar sentences are often coded differently. The RoBERTa
models show that green parties are treated differently regarding the topic of environmental
protection. For instance, their sentences about agriculture are overproportionally
assigned the code environmental protection, while segments discussing
sustainability are over-proportionally assigned the topic “anti-growth economy” when
compared to non-green parties. The actual content of these segments is similar, but
the coding seems to be influenced by the party affiliation.
Overall, our results indicate the presence of a coder bias in the Manifesto Project
data, which researchers and practitioners should be aware of when using this dataset
in their work.
Degree
Student essay
Collections
View/ Open
Date
2024-10-16Author
Carlsson, Leo
Hanlon, Konstantin
Keywords
Coding bias
Transformers
Natural Language Processing