Lilja Øvrelid Argument Differentiation Data linguistica Editor: Lars Borin Språkbanken • Språkdata Department of Swedish Language University of Gothenburg 20 • 2008 Lilja Øvrelid Argument Differentiation Soft constraints and data-driven models Gothenburg 2008 ISBN 978-91-87850-35-6 ISSN 0347-948X Printed in Sweden by Intellecta Docusys Västra Frölunda 2008 Typeset in LATEX 2ε by the author Cover design by Kjell Edgren, Informat.se Front cover illustration: How to describe the world is still an open question by Randi Nygård c© Author photo on back cover by Rudolf Rydstedt ABSTRACT The ability to distinguish between different types of arguments is central to syntactic analysis, whether studied from a theoretical or computational point of view. This thesis investigates the influence and interaction of linguistic prop- erties of syntactic arguments in argument differentiation. Cross-linguistic gen- eralizations regarding these properties often express probabilistic, or soft, con- straints, rather than absolute requirements on syntactic structure. In language data, we observe frequency effects in the realization of syntactic arguments. We propose that argument differentiation can be studied using data-driven methods which directly express the relationship between frequency distribu- tions in language data and linguistic categories. The main focus in this thesis is on the formulation and empirical evaluation of linguistically motivated fea- tures for data-driven modeling. Based on differential properties of syntactic arguments in Scandinavian language data, we investigate the linguistic factors involved in argument differentiation from two different perspectives. We study automatic acquisition of the lexical semantic category of animacy and show that statistical tendencies in argument differentiation supports auto- matic classification of unseen nouns. The classification is furthermore robust, generalizable across machine learning algorithms, as well as scalable to larger data sets. We go on to perform a detailed study of the influence of a range of different linguistic properties, such as animacy, definiteness and finiteness, on argument disambiguation in data-driven dependency parsing of Swedish. By including features capturing these properties in the representations used by the parser, we are able to improve accuracy significantly, and in particular for the analysis of syntactic arguments. The thesis shows how the study of soft constraints and gradience in lan- guage can be carried out using data-driven models and argues that these pro- vide a controlled setting where different factors may be evaluated and their influence quantified. By focusing on empirical evaluation, we come to a better understanding of the results and implications of the data-driven models and furthermore show how linguistic motivation in turn can lead to improved com- putational models. ACKNOWLEDGEMENTS This thesis has been a big part of my life for several years and to think that it is actually finished now is truly beyond my grasp. I do know, however, that there are numerous people who have helped and supported me and whom it is my undivided pleasure to thank. I want to express my gratitude to my two supervisors, Elisabet Engdahl and Joakim Nivre. Elisabet welcomed me to Gothenburg over four years ago and has since then been a person to be counted with in my life. She has provided advice and pointed criticism on all aspects of my work, made me think and rethink linguistic issues small and large and pushed me to move on when I was frozen. Thank you for your enthusiasm and interest, for truly caring, for always making time, for reading into the last hours, and for being such an open- minded, outstanding linguist. Joakim has been involved almost from the very beginning and has provided invaluable insight and inspiration in the writing of this thesis. Thank you so much for taking time out of your busy schedule, for always showing a genuine interest in my work, for your clarity of thought, formal expertise and for new ideas. Thank you both for believing in me when I did not! There are several other people who have read and commented on parts of this thesis along the way and whom I would like to give my warmest thanks to: Maia Andréasson, Harald Hammarström, Fredrik Heinat, Helen de Hoop, Jerker Järborg, Ida Larsson, Benjamin Lyngfelt, Malin Petzell and Annie Za- enen. A special thanks to Beáta Megyesi for scrutinizing a first draft of this thesis for my final seminar and providing very useful comments. I want to thank Helen de Hoop, Monique Lamers, Peter de Swart, Sander Lestrade and everyone in the PIONIER project at Radboud University, Ni- jmegen for welcoming me as a guest researcher and for sharing thoughts on the ever-fascinating topic of animacy. I would also like to thank Gemma Boleda for discussions about classification, Ryan McDonald for advice on the MST- Parser experiments and Johan Hall for help with MaltTagger. No woman is an island and I have been fortunate to be part of several stim- ulating research environments. I would like to thank the Graduate School of Language Technology (GSLT) for providing top-quality courses and an inspir- ing setting in which to meet fellow PhD-students and senior researchers and iv Acknowledgements discuss and get feedback. I have benefited immensely from being a part of GSLT. A special thanks to Atelach Alemu and Karin Cavallin for a most mem- orable trip to Tuscany, Eva Forsbom for discussions on annotation, to Ebba Gustavii, with whom I started exploring dependency parsing, and to Harald Hammarström, Hans Hjelm, Maria Holmqvist, Svetoslav Marinov and all the other PhD-students for all the good times. In Gothenburg I have had the plea- sure of being part of the NLP-unit at the Dept. of Swedish as well as the newly started Center for Language Technology (CLT). I want to express a big thanks to Lars Borin for the work he has spent editing my thesis and for being such a friendly boss, Dimitrios Kokkinakis for being so helpful and letting me use his eminent suite of Swedish NLP-tools, to Rudolf Rydstedt for letting me take up a lot of disk space and for help with photography, to Robert Andersson for all technical assistance and to Dana Dannells, Karin Friberg, Jerker Jär- borg, Sofie Johansson-Kokkinakis, Leif-Jöran Olsson, Torgny Rasmark, Maria Toporowska-Gronostaj, Karin Warmenius and everyone else at Språkdata for being such a great group of colleagues. At the Dept. of Swedish, I also want to give a special thanks to the members of the OT reading group for inspiring discussions about linguistics. Moving to Gothenburg from Oslo, I could never have asked for better col- leagues, who soon became close friends. Annika Bergström, Ida Larsson and Karin Cavallin, thank you for your endless support and friendship. I want to give a very special thanks to Ida for giving me daily doses of porridge, perfect matters and perspective on thesis-writing, linguistics and life in general. I want to thank my fabulous friends in Oslo, Madrid and New York for keeping me grounded. Thanks to Randi Nygård for letting me use her lovely drawing on the cover of this book. I want to extend the warmest thanks possible to my dear family for all the love and support through what has been a life-altering time. And finally, Fredrik, I could have written this thesis without you, but I certainly would not have wanted to. Thank you all! Lilja Øvrelid Gothenburg, April 20th, 2008 CONTENTS Abstract i Acknowledgements iii 1 Introduction 1 1.1 Argument differentiation . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Data-driven models . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Modeling argument differentiation . . . . . . . . . . . . . . . . . 3 1.4 Assumptions and scope of the thesis . . . . . . . . . . . . . . . . 5 1.5 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 5 I Background 9 2 Soft constraints 11 2.1 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Frequency as linguistic evidence . . . . . . . . . . . . . 12 2.1.2 The mental status of frequency . . . . . . . . . . . . . . 14 2.1.3 Frequency and modeling . . . . . . . . . . . . . . . . . 14 2.2 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 The status of constraints . . . . . . . . . . . . . . . . . . 16 2.2.2 Soft constraints . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Incrementality . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.1 Ambiguity processing . . . . . . . . . . . . . . . . . . . 20 2.3.2 Constraining interpretation . . . . . . . . . . . . . . . . 22 2.4 Gradience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.1 Grammaticality . . . . . . . . . . . . . . . . . . . . . . 23 2.4.2 Categories . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3 Linguistic dimensions of argument differentiation 27 3.1 Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Animacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.1 Animacy of arguments . . . . . . . . . . . . . . . . . . 31 vi Contents 3.2.2 Ambiguity resolution . . . . . . . . . . . . . . . . . . . 32 3.2.3 The nature of animacy effects . . . . . . . . . . . . . . . 33 3.2.4 Gradient animacy . . . . . . . . . . . . . . . . . . . . . 34 3.3 Definiteness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.1 Definite arguments . . . . . . . . . . . . . . . . . . . . 38 3.4 Referentiality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4.1 Referentiality and arguments . . . . . . . . . . . . . . . 40 3.5 Relational properties . . . . . . . . . . . . . . . . . . . . . . . . 41 3.6 Interaction and generalization . . . . . . . . . . . . . . . . . . . 42 3.6.1 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.6.2 A more general property . . . . . . . . . . . . . . . . . 44 4 Properties of Scandinavian morphosyntax 49 4.1 Morphological marking . . . . . . . . . . . . . . . . . . . . . . 49 4.1.1 Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1.2 Definiteness . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Word order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.1 Initial variation . . . . . . . . . . . . . . . . . . . . . . 54 4.2.2 Rigid verb placement . . . . . . . . . . . . . . . . . . . 55 4.2.3 Variable argument placement . . . . . . . . . . . . . . . 57 4.2.4 More variation . . . . . . . . . . . . . . . . . . . . . . . 58 5 Resources 59 5.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.1.1 Talbanken05 . . . . . . . . . . . . . . . . . . . . . . . . 59 5.1.2 Parole . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.1.3 The Oslo Corpus . . . . . . . . . . . . . . . . . . . . . 66 5.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2.1 Decision trees (C5.0) . . . . . . . . . . . . . . . . . . . 68 5.2.2 Memory-Based Learning (TiMBL) . . . . . . . . . . . . 69 5.2.3 Clustering (Cluto) . . . . . . . . . . . . . . . . . . . . . 70 5.3 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3.1 MaltParser . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3.2 MSTParser . . . . . . . . . . . . . . . . . . . . . . . . 71 II Lexical Acquisition 73 6 Acquiring animacy – experimental exploration 75 6.1 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.1.1 Animacy . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Contents vii 6.1.2 Verb frames and classes . . . . . . . . . . . . . . . . . . 78 6.2 Data preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2.1 Language and corpus resource . . . . . . . . . . . . . . 80 6.2.2 Noun selection . . . . . . . . . . . . . . . . . . . . . . 81 6.2.3 Features of animacy . . . . . . . . . . . . . . . . . . . . 82 6.3 Method viability . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3.1 Experimental methodology . . . . . . . . . . . . . . . . 87 6.3.2 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . 88 6.4 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4.1 Experiment 2: Effect of sparse data on classification . . . 90 6.4.2 Experiment 3: Back-off features . . . . . . . . . . . . . 92 6.4.3 Experiment 4: Back-off classifiers . . . . . . . . . . . . 94 6.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.5 Machine learning algorithm . . . . . . . . . . . . . . . . . . . . 96 6.5.1 Experimental methodology . . . . . . . . . . . . . . . . 96 6.5.2 Experiment 5: High frequency nouns . . . . . . . . . . . 97 6.5.3 Experiment 6: Lower frequency nouns . . . . . . . . . . 98 6.5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.6 Class granularity: classifying organizations . . . . . . . . . . . . 102 6.6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.6.2 Experiment 7: Granularity . . . . . . . . . . . . . . . . 104 6.6.3 The distribution of organizations . . . . . . . . . . . . . 107 6.6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 115 6.7 Unsupervised learning as class exploration . . . . . . . . . . . . 116 6.7.1 Experiment 8: Clustering . . . . . . . . . . . . . . . . . 116 6.8 Summary of main results . . . . . . . . . . . . . . . . . . . . . . 121 7 Acquiring animacy – scaling up 123 7.1 Obtaining animacy data . . . . . . . . . . . . . . . . . . . . . . 124 7.1.1 Animacy annotation . . . . . . . . . . . . . . . . . . . . 124 7.1.2 Person reference in Talbanken05 . . . . . . . . . . . . . 128 7.2 Data preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 140 7.2.1 Talbanken05 nouns . . . . . . . . . . . . . . . . . . . . 140 7.2.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7.2.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . 142 7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 7.3.1 Experimental methodology . . . . . . . . . . . . . . . . 146 7.3.2 Original features . . . . . . . . . . . . . . . . . . . . . . 147 7.3.3 General feature space . . . . . . . . . . . . . . . . . . . 151 7.3.4 Feature analysis . . . . . . . . . . . . . . . . . . . . . . 153 7.3.5 Error analysis . . . . . . . . . . . . . . . . . . . . . . . 158 viii Contents 7.4 Summary of main results . . . . . . . . . . . . . . . . . . . . . . 161 III Parsing 165 8 Argument disambiguation in data-driven dependency parsing 167 8.1 Syntactic parsing . . . . . . . . . . . . . . . . . . . . . . . . . . 167 8.1.1 Data-driven parsing . . . . . . . . . . . . . . . . . . . . 168 8.1.2 Dependency parsing . . . . . . . . . . . . . . . . . . . . 170 8.1.3 Data-driven dependency parsing . . . . . . . . . . . . . 171 8.2 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 8.2.1 A methodology for error analysis . . . . . . . . . . . . . 175 8.2.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 8.2.3 General overview of errors . . . . . . . . . . . . . . . . 177 8.3 Errors in argument assignment . . . . . . . . . . . . . . . . . . . 178 8.3.1 Arguments in Scandinavian . . . . . . . . . . . . . . . . 180 8.3.2 Subject and direct object errors . . . . . . . . . . . . . . 189 8.3.3 Formal subject errors . . . . . . . . . . . . . . . . . . . 195 8.3.4 Indirect object errors . . . . . . . . . . . . . . . . . . . 199 8.3.5 Subject predicative errors . . . . . . . . . . . . . . . . . 200 8.3.6 Argument and non-argument errors . . . . . . . . . . . . 202 8.3.7 Head distance . . . . . . . . . . . . . . . . . . . . . . . 203 8.4 Setting the scene . . . . . . . . . . . . . . . . . . . . . . . . . . 204 9 Parsing with linguistic features 207 9.1 Linguistic features . . . . . . . . . . . . . . . . . . . . . . . . . 208 9.1.1 Empirical approximations . . . . . . . . . . . . . . . . . 209 9.2 Experiments with linguistic features . . . . . . . . . . . . . . . . 210 9.2.1 Experimental methodology . . . . . . . . . . . . . . . . 210 9.2.2 Animacy . . . . . . . . . . . . . . . . . . . . . . . . . . 213 9.2.3 Definiteness . . . . . . . . . . . . . . . . . . . . . . . . 215 9.2.4 Pronoun type . . . . . . . . . . . . . . . . . . . . . . . 216 9.2.5 Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 9.2.6 Verbal features . . . . . . . . . . . . . . . . . . . . . . 220 9.2.7 Feature combinations . . . . . . . . . . . . . . . . . . . 224 9.2.8 Selectional restrictions . . . . . . . . . . . . . . . . . . 227 9.3 Features of the parser . . . . . . . . . . . . . . . . . . . . . . . . 237 9.3.1 Parser comparison . . . . . . . . . . . . . . . . . . . . . 237 9.3.2 Feature locality . . . . . . . . . . . . . . . . . . . . . . 243 9.3.3 Features of argument differentiation . . . . . . . . . . . 245 9.4 Automatically acquired features . . . . . . . . . . . . . . . . . . 246 Contents ix 9.4.1 Acquiring the features . . . . . . . . . . . . . . . . . . . 246 9.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 250 9.5 Summary of main results . . . . . . . . . . . . . . . . . . . . . . 257 10 Concluding remarks 261 10.1 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . 261 10.1.1 Lexical acquisition . . . . . . . . . . . . . . . . . . . . 262 10.1.2 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 10.1.3 Argument differentiation . . . . . . . . . . . . . . . . . 265 10.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 References 271 LIST OF FIGURES 1 The ‘identifiability’ criterion for definiteness and specificity . . 37 2 Dependency representation of example from Talbanken05. . . 62 3 Dependency representation of example with subordinate clause from Talbanken05. . . . . . . . . . . . . . . . . . . . . . . . 65 4 Example feature vectors. . . . . . . . . . . . . . . . . . . . . 85 5 Accuracy as a function of absolute noun frequencies for clas- sifiers with all versus individual features. . . . . . . . . . . . . 93 6 Accuracy as a function of absolute noun frequencies for clas- sifiers with backed-off features. . . . . . . . . . . . . . . . . . 94 7 Animacy classification scheme. . . . . . . . . . . . . . . . . . 125 8 Rank frequency profile of all Parole nouns. . . . . . . . . . . 144 9 Decision tree acquired for the >100 data set in experiments with a general feature space. . . . . . . . . . . . . . . . . . . 155 10 Algorithm for automatic feature selection with backward search 157 11 Baseline feature model for Swedish . . . . . . . . . . . . . . 173 12 Head distance in correct versus errors for argument relations . 204 13 Extended feature model for Swedish . . . . . . . . . . . . . . 211 14 Total number of SS_OO errors and OO_SS errors in the experi- ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 15 Dependency representation of example (176) . . . . . . . . . 229 1 INTRODUCTION The main goal of syntactic analysis is often bluntly summarized as figuring out “who does what to whom?” in natural language. At the core of this simplifica- tion, however, is the idea that central to the understanding of a natural language sentence is the understanding of the predicate-argument structure which it ex- presses, and, in particular, the syntactic relationship which holds between the predicate and its individual arguments. The study of the relationship between meaning and form, how the syntactic expression of a certain semantic propo- sition precisely reflects the meaning which we wish to convey, can be seen to unite current syntactic theories. In the field of computational linguistics, syntactic parsing constitutes a central topic, where the main focus is on the automatic assignment of syntactic structure to natural language. The relation between syntax and semantics is furthermore exploited in work on automatic acquisition of lexical semantics, where the syntactic distribution of an element is seen as indicative of certain semantic properties. In psycholinguistics, the understanding of how we as language users perform this mapping in real-time comprehension has been widely studied. The study of argument differentiation focuses on the distinguishing properties of syntactic arguments which are cen- tral to syntactic analysis, whether studied from a theoretical, experimental or computational point of view. This is the central topic of this thesis. 1.1 Argument differentiation Syntactic arguments express the main participants in an event, hence are inti- mately linked to the semantics of a sentence. Syntactic arguments also occur in a specific discourse context where they convey linguistic information. For in- stance, the subject argument often expresses the agent of an action, hence will tend to refer to a human being. Moreover, subjects typically express the topic of the sentence and will tend to be realized by a definite nominal. These types of generalizations regarding the linguistic properties of syntactic arguments express probabilistic, or ‘soft’, constraints, rather than absolute requirements 2 Introduction on syntactic structure. In language data, we observe frequency effects in the realization of syntactic arguments and a range of linguistic studies emphasize the correlation between syntactic function and various linguistic properties, such as animacy and definiteness. These properties are recurring also in cross- linguistic studies where they determine argument differentiation to varying de- grees in different languages. The realization of a predicate-argument structure is furthermore subject to surface-oriented and often language-specific restrictions relating to word or- der and morphology. In many languages, the structural expression of syntactic arguments exhibits variation. The Scandinavian languages, for instance, are characterized by a rigid verb placement and a certain degree of variation in the positioning of syntactic arguments. Work in syntactic theory which sepa- rates the function-argument structure from its structural realization highlights exactly the mediating role of arguments between semantics and morphosyntax. An understanding of the influence of different linguistic factors and their interaction in argument differentiation clearly calls for a principled modeling of soft constraints and the frequency effects which these incur in language data. Semantic properties of verbs and their relation to syntactic realization have been given much attention both in theoretical and computational linguis- tic studies. The central status of the predicate as syntactic head, selecting and governing its arguments, is hardly under dispute. However, a focus on linguis- tic properties of syntactic arguments is important, both from a theoretical and a more practical or applied point of view. The study of properties of arguments and their influence in argument differentiation highlights cross-linguistic ten- dencies in the relation between syntax and semantics. It furthermore raises theoretically relevant questions regarding the modeling of these insights, the interaction between levels of linguistic analysis and the relation between theo- retical results and practical applications. 1.2 Data-driven models Recent decades have witnessed an empirical shift in the field of computational linguistics. New types and quantities of data have enabled new types of gen- eralizations, and empirical, data-driven models are by now widely used. A defining property of these models is found in the systematic combination and weighting of different sources of evidence. In the processing of natural lan- guage, the ability to generalize over complex interrelationships has provided impressive results for a range of different NLP tasks. A central theorem in machine learning theory emphasizes the fact that all learning requires a bias, that is, the learning problem must be defined in such a 1.3 Modeling argument differentiation 3 way as to make generalization possible. Different machine learning algorithms come with different biases and an understanding of the way in which the search for the most likely hypothesis is performed is important in order to understand the results. Moreover, in order for learning to take place, the input data must be represented in such a way as to capture useful distinctions. The selection of features employed in the representation of the training data can have dramatic effects on results. There exists a pronounced interest in a deeper understanding of the results obtained using data-driven methods and how these relate to generalizations from more theoretically oriented work. Empirical methods have gained mo- mentum also in theoretical linguistics in recent years, where important insights revolve around the role and theoretical interpretation of language data and the modeling thereof. The exchange of insights and results constitutes an impor- tant step for further advancement of the study of natural language processing and linguistics in general. It is clear, however, that such an understanding re- quires an understanding of the data-driven models themselves as well as the implications of various representational choices. In the modeling of natural language, it is certainly not always the case that the most linguistically in- formed system is also the best performing system. Data-driven models, largely being probabilistic, furthermore have a reputation for being chaotic and diffi- cult to interpret. In this respect, theoretically motivated hypotheses regarding linguistic analysis may provide a clarifying perspective. 1.3 Modeling argument differentiation In this thesis, we propose that argument differentiation should be studied using data-driven methods which highlight the direct relationship between frequency distributions in language data and linguistic categories. The commitment is strictly empirical in that we will not explicitly formulate a set of constraints or a grammar for the interpretation of syntactic arguments. Rather, the focus will be on an explicit formulation and evaluation of a learning bias in terms of lin- guistically motivated features and evaluation of these. We will investigate the linguistic factors involved in argument differentiation, from two different per- spectives, both highlighting different aspects of syntactic argumenthood and the relation between linguistic theory and model. Animacy is a linguistic property which has been claimed to be an impor- tant factor in argument differentiation both in cross-linguistic studies and in psycholinguistic work. If this assumption is correct, we may hypothesize that differentiated arguments should provide important clues with respect to the property of animacy. In this thesis, we will investigate lexical acquisition of 4 Introduction animacy information based on syntactic, distributional features. By general- izing over the syntactic distribution of individual noun tokens, we may study linguistic properties of syntactic arguments irrespective of their specific re- alization in a particular sentence. In this way we may capture empirical fre- quency effects in the mapping between syntax and semantics. Through the application and evaluation of data-driven machine learning methods, we will investigate theoretical claims regarding the relationship between syntactic ar- guments and the property of animacy, as well as the robustness and reliability of such correlations. The focus is thus on the relation of syntactic arguments to lexical semantics, and the types of generalizations which can be obtained under current distributional approaches to computational semantics. The more abstract task of argument differentiation can be directly linked to the practical task of automatic syntactic parsing. We propose that the task of argument disambiguation in a data-driven system provides us with a set- ting where the effect of various linguistic properties may be tested, and their interaction studied experimentally. In this respect, the property of being data- driven, as opposed to grammar-driven, allows for argument differentiation to be directly acquired through frequency of language use and with minimal the- oretical assumptions. It enables an investigation of the relation of syntactic arguments to semantic interpretation, as well as to explicit, formal marking such as case and word order. Moreover, we may investigate whether the task of argument disambiguation can be improved by theoretically informed fea- tures and error analysis. The overall research questions addresses in this thesis may be formulated as follows: 1. How are syntactic arguments differentiated? • Which linguistic properties differentiate arguments? • How do linguistic properties interact to differentiate an argument? 2. How may we capture argument differentiation in data-driven models of language? What are the effects? The two main questions posed above are addressed throughout this thesis and can be viewed as constituting the central motivation behind the work presented here. Following from these, several more specific research questions will be posed during the course of the thesis which serve to further elucidate the topic of argument differentiation and its data-driven modeling. 1.5 Outline of the thesis 5 1.4 Assumptions and scope of the thesis The main languages in focus in this thesis are Scandinavian type languages, exemplified primarily by Swedish and Norwegian. The phenomena studied are not, however, limited to Swedish or Norwegian and we provide examples from a range of languages. The Scandinavian type languages exhibit some properties which make them interestingly different from English, while still being similar enough to warrant comparison. The case of argument differentiation touches upon issues that are relevant for several other languages and on methodologi- cal and theoretical issues which are of interest to linguists and computational linguists alike. We aim throughout the thesis at a fairly theory-neutral investigation of argu- ments and argument differentiation. However, due to the nature of the problems which the thesis addresses, a certain bias will be present in the theories which are most readily used for exemplification and comparison. These will include lexicalist theories, due to the link to lexical semantics and non-modular theo- ries, due to the mixed nature of the constraints taken from the syntax-semantics interface. 1.5 Outline of the thesis The thesis is organized into three parts, where the two central parts, Part II and III, are largely independent and may be read separately. Part I: Background provides the relevant background by introducing the theoretical terminology, as well as models and resources employed in the ensuing parts of the thesis. Chapter 2: Soft constraints addresses notions of soft, probabilistic constraints in linguistic theory. We discuss the role of frequency in the study of language and introduce the notion of soft, probabilistic constraints on language. The effect of incrementality on linguistic generalizations further leads us to the notion of linguistic ambiguity which is central to computational language pro- cessing, and syntactic parsing in particular. Finally, we discuss the notion of gradience and, more specifically, gradience in linguistic categories. Chapter 3: Linguistic dimensions of argument differentiation starts out by in- troducing the notion of argumenthood in linguistics, as well as establishing a set of central distinctions within the group of arguments. We further intro- duce linguistic properties which have been proposed to differentiate syntactic arguments, in particular the property of animacy, as well as definiteness and 6 Introduction referentiality. We present evidence from linguistic studies providing cross- linguistic, as well as psycholinguistic and empirical support for the role of these properties in argument differentiation. Chapter 4: Properties of Scandinavian morphosyntax describes some relevant properties of the Scandinavian languages, with a particular focus on the mor- phological and structural expression of syntactic arguments. Chapter 5: Resources describes the corpora and resources employed for ma- chine learning and parsing in the following two parts of the thesis. We provide a brief introduction to dependency representations, which will be central in Part III of the thesis. We also discuss some important distinctions in machine learning of linguistic data and present decision tree learning, memory-based learning and clustering. Part II: Lexical Acquisition concerns lexical acquisition of animacy information, with focus on the task of animacy classification. We briefly introduce the area of lexical acquisition and previous work which has focused on the relation between syntax and seman- tics. Chapter 6: Acquiring animacy – experimental exploration presents a detailed study of animacy classification which investigates theoretical and practical is- sues including a definition of the learning task, feature selection and extrac- tion, results, robustness to data sparseness and implications for the choice of machine learning algorithm. Chapter 7: Acquiring animacy – scaling up deals with the scaling up of lexical acquisition of animacy information. We discuss schemes for animacy annota- tion and our requirements on such annotation. We experiment with a general- ization of the results from chapter 6 in the application of animacy classification to a new data set in a different, although closely related, language. We dis- cuss issues of data representation, data sparsity, class distribution and machine learning algorithm further and provide a quantitative evaluation of the method, as well as in-depth feature and error analysis. Part III: Parsing presents experiments in argument disambiguation, with a focus on linguistic features relating to argument differentiation. We introduce data-driven depen- dency parsing and motivate its use in the study of argument differentiation. Chapter 8: Argument disambiguation in data-driven dependency parsing starts 1.5 Outline of the thesis 7 out by defining a methodology for error analysis of parse results. We proceed to apply the methodology to a baseline parser for Swedish. We discuss the types of generalizations which are acquired regarding syntactic arguments and furthermore relate the errors to properties of argument expression in Scandi- navian type languages. Chapter 9: Parsing with linguistic features investigates the effect of theoreti- cally motivated linguistic features on the analysis of syntactic arguments. We present a range of experiments evaluating the effect of different linguistic di- mensions in terms of overall parse results, as well as on argument disambigua- tion in particular. We furthermore evaluate the effect of different parser proper- ties on the results and discuss scalability in terms of parsing with automatically acquired features. Chapter 10: Concluding remarks concludes the thesis by outlining its main contributions and directions for future work. Part I Background 2 SOFT CONSTRAINTS The surge of empiricism characterising the last decades in the field of compu- tational linguistics has also influenced the field of theoretical linguistics. The availability of large corpora and fairly good automatic annotation thereof pro- vides the possibility to make new types of generalizations about language and language use. Dealing with real language with all its imperfections and mas- sive variation has sparked an interest in more empirically motivated methods and models also within theoretical linguistics. In particular, the strict comp- etence-performance dichotomy has been called into question. The main con- cern is that the traditional categorical distinctions are unsatisfactory in their coverage: “there is a growing interest in the relatively unexplored gradient middle ground, and a growing realization that concentrating on the extremes of continua leaves half the phenomena unexplored and unexplained” (Bod, Hay and Jannedy 2003: 1). Based on work in both computational, theoretical and experimental linguis- tics, this chapter discusses a discernable shift in the view of human language and the modeling thereof. In particular, this shift is characterized by an ac- knowledgement that bridging the divide between studies of competence and studies of performance can be fruitful in unifying insights obtained in the var- ious subfields of linguistics. Empirical investigations of language rely on the use of new types of data, in particular frequency of language use. The modeling of these results express probabilistic grammars of soft constraints on linguistic structure. The role of constraints in language processing and, in particular, the notion of incrementality raise further questions about the nature of constraints and their interaction. A probabilistic view of language furthermore entails gra- dience of grammaticality, as well as linguistic categories in general. 2.1 Frequency The data-driven methods prevalent in current computational linguistics rely to a large extent on statistical modeling where frequency of usage is employed 12 Soft constraints to approximate probabilities. An interesting question is whether frequency in language and modeling thereof expresses generalizations of interest to more theoretically oriented linguists as well. Frequency has first and foremost been viewed as a property of performance or language use and frequency effects are found within all areas of linguistic realization. In the following we examine the role of frequency in linguistic theory, with particular focus on frequency as theoretical data, its role in language processing and in modeling of both practical, theoretical and experimental results. 2.1.1 Frequency as linguistic evidence The view of what constitutes linguistic evidence is one distinguishing factor between largely rationalist and empiricist approaches to the study of human language. The rationalist view of linguistic theory, with inspiration taken from the natural sciences, sees the main task as the modeling of our internal linguis- tic knowledge, or competence, and introspection is considered sufficient evi- dence to this end. Strictly empiricist approaches, on the other hand, consider real language data to be paramount and the primary object of study, not nec- essarily attempting generalization across data sets. Within the area of corpus linguistics, the study of linguistic phenomena is synonymous with the study of frequency distributions in language use and corpus data is widely employed within a range of sub-disciplines of linguistics, e.g. lexicography, sociolinguis- tics, spoken language etc. (McEnery and Wilson 1996). This empiricist focus on properties of naturally occurring data has been viewed as irreconcilable with the rationalist goals. The strict division between rationalism and empiri- cism is admittedly an oversimplification. Most current day linguists employ both kinds of data in their theoretical and/or descriptive work. However, the extent to which properties observed in the data form part of a comprehensive model with testable consequences is not always explicitly clear. Recent syntactic work within Optimality Theory (OT)1 has exploited a gradient notion of markedness expressed through a set of ranked, universal constraints and has promoted the idea that “soft constraints mirror hard con- straints” (Bresnan, Dingare and Manning 2001: 1); linguistic generalizations which incur categorical effects in some languages show up as strong statistical tendencies in other languages. This certainly calls the competence-performance dichotomy into question and in particular, the effect that the very same gener- alizations should form part of linguistic competence for the speakers of one language but be considered mere performance effects in another. The proposal 1See the introductory sections in Kager 1999 for an introduction to the main tenets of OT. 2.1 Frequency 13 that a probabilistic grammar might be an alternative which provides a compre- hensive model of these facts and thus cuts across the traditional competence- performance divide has emerged. The idea that some linguistic generalizations are reducible to frequency of use is not new. The work within OT mentioned above, has adopted from functional and typological work the notion of markedness, which is based on “asymmetrical or unequal grammatical properties of otherwise equal linguistic elements” (Croft 2003: 87), where the more unmarked an element is, the more natural and typical it is. Frequency is clearly related to the notion of marked- ness and often figures as a criterion for this distinction (Croft 1990). It has been argued, however, that this notion of markedness may simply be reduced to differential frequency of language use (Haspelmath 2006). Rather than in- troducing the additional notion of markedness to account for these frequency effects, we should refer directly to frequency as the determining factor.2 Frequency as the central explaining factor is found in largely non-generative, usage-based accounts (Barlow and Kemmer 2000; Bybee and Hopper 2001), where the key role of frequency is linked to linguistic induction or learning. Starting from the same generalization that phenomena are frequent to vary- ing degrees in different languages and calling the competence-performance distinction into question, we see that it is possible to arrive at an alternative conclusion, namely that it is all performance. In general we can see that the role of frequency effects in language raises the issue of the balance between learning and innateness, i.e. how much of our linguistic knowledge is acquired and how much is innate? In this respect we may view the mainstream generative paradigm and the usage-based ap- proaches mentioned above as representing extreme oppositions. Recent work discussing the theoretical implications of data-driven models, highlights the use of machine learning to assess hypotheses regarding language acquisition and the so-called ‘poverty of the stimulus’ argument for innateness (Lappin and Shieber 2007). Investigations into the relationship between syntactic struc- ture and lexical semantics, and, in particular verbal semantic classes, have furthermore highlighted the use of machine learning methods over frequency distributions in language to test linguistic hypotheses (Merlo and Stevenson 2004). 2The type of markedness argument certainly has a flair of circularity: an element is unmarked because it is frequent and frequent because it is unmarked. 14 Soft constraints 2.1.2 The mental status of frequency Within psycholinguistics it has long been recognized that frequency plays a key role in human language processing and, furthermore, it is largely believed that language processing is probabilistic (Jurafsky 2003). Frequency has been shown to be an important factor in several areas of language comprehension (Jurafsky 2003):3 Access Frequent lexical items are accessed, hence processed, faster. Disambiguation The frequency of various interpretations influences process- ing of ambiguity. Processing difficulty Low-frequent interpretations cause processing difficul- ties. These frequency effects are mostly connected to lexical form, i.e., word form or category, or lexical semantics. For instance, it has been shown that frequent words are processed faster. With respect to lexical ambiguities, studies indi- cate that use of the most frequent morphological category or most frequent sense of a lexeme stands in a direct relation to processing time. With respect to structural ambiguities in language comprehension, subcategorization frame probabilities have been related to parsing difficulties in notorious garden-path sentences, such as, e.g., The horse raced past the barn fell, see section 2.3.1. Efforts to link results from empirically oriented, theoretical work with psy- cholinguistic evidence have highlighted the role of frequency also in produc- tion, in particular with respect to variation or syntactic choice. Bresnan (2006) presents results from forced continuation experiments on the dative alterna- tion and argues that the same set of soft, probabilistic, constraints which were shown to correlate with the choice of dative construction in corpus studies (Bresnan and Nikitina 2007; Bresnan et al. 2005) are also active in the judge- ments of language users. This indicates that language users have detailed knowl- edge on the interaction of constraints and Bresnan (2006) concludes, somewhat controversially, that syntactic knowledge is in fact probabilistic in nature. 2.1.3 Frequency and modeling Frequency effects in language lend themselves readily to probabilistic mod- eling and provide empirical estimates for probabilistic model parameters. In 3Jurafsky (2003) reasons that these phenomena are influenced by probability and goes on to present evidence from experiments showing the effect of raw frequencies or conditional proba- bilities estimated by frequencies. 2.1 Frequency 15 computational linguistics, probabilistic modeling based on language frequen- cies has permeated practically all areas of analysis.4 Stochastic models, such as Hidden Markov models (HMMs) and Bayesian classifiers have been widely employed in word-based tasks such as part-of-speech tagging and word sense disambiguation. In parsing, probabilistic extensions of classical grammar for- malisms, such as probabilistic context-free grammars (PCFGs) (Charniak 1996) and the lexicalized successors in various incarnations (Collins 1996; Char- niak 1997; Bikel 2004), have dominated the constituent-based approaches to parsing. Central to this development has been the use of syntactically an- notated corpora, or treebanks (Abeillé 2003) and parameter estimation from treebanks.5 The use of statistical inference in induction of information from corpus data constitutes an integral part of most NLP systems, recasting a range of complex problems, such as named-entity tagging (Tjong Kim Sang 2002b), phrase detection/chunking (Tjong Kim Sang and Buchholz 2000), parsing (Buchholz and Marsi 2006; Nivre et al. 2007) and semantic role labeling (Car- reras and Màrquez 2005) as classification problems. Probabilistic models have also been widely employed to model human lan- guage processing. The primary concern is that these models should provide realistic approximations of the language processing task and, in particular, be predictive of the types of processing effects indicated by experimental re- sults. For the processing of lexical ambiguities, HMMs have been employed and syntactic ambiguities have been modeled employing probabilistic exten- sions of grammars, such as probabilistic context-free grammars (PCFGs). The processing difficulties observed in conjunction with the garden-path sentences mentioned above, so-called ‘reanalysis’, can then be directly related to the presence of an additional rule with a small probability in the reanalysis. Fur- thermore, within the area of language acquisition, probabilistic modeling is common and the learning problem can be formulated as acquisition of a set of weighted constraints through exposure to linguistic data, expressing a connec- tionist, functionalist view of language, (see, e.g., Seidenberg and MacDonald 1989). Within theoretical linguistics, the probabilistic modeling of frequencies has mostly been descriptive, for instance in testing statistical significance of distri- butional differences. To a certain extent, probabilistic models have also been employed to test the strength of various correlations by means of logistic re- gression models in particular, (see, e.g., Bresnan et al. 2005; Rahkonen 2006; 4See Manning and Schütze 1999 for an overview. 5Note however that lexicalized parsers necessarily rely on advanced techniques for smooth- ing of sparse data, hence maximum likelihood estimation is not sufficient for parameter esti- mation. One common technique is to markovize the rules (Collins 1999; Klein and Manning 2003). 16 Soft constraints Bouma 2008). Probabilistic models also provide a method for modeling the in- teraction of probabilities over syntactic structure without necessarily demand- ing a rebuttal of the tools of formal syntactic models and frameworks devel- oped over a long period of time. A simple example is a probabilistic context- free grammar which conditions the probability of a sentence on the probabil- ities of its subtrees. However, more sophisticated theories of syntax based on a notion of probability have also been proposed (Bod 1998). In theories where grammatical generalizations are expressed as constraints on structure, these constraints may themselves be associated with probabilities (or ‘weights’) and their interaction modeled using probabilistic models. Within the framework of Optimality Theory there has been a substantial amount of work in recent years on probabilistic formulations of constraint interaction. 2.2 Constraints Generally speaking, a constraint restricts a solution, usually by providing a condition which must be fulfilled. Constraint-based theories are central in the theoretical and psycholinguistic modeling of syntactic structure. However, prop- erties of the constraints employed differ in a way that corresponds with the object of study and the data employed to do so. In theoretical linguistics, the constraints are generally assumed to be absolute and based on strict gram- maticality judgements, whereas experimental results indicate the use of prob- abilistic constraints in human language processing. Recent work in theoretical linguistics, however, opens up for a reconsideration of properties of constraints as a reflection of linguistic knowledge. 2.2.1 The status of constraints Within the discourse of syntactic formalisms, the term ‘constraint’ has been widely used. Constraint-based theories such as Head-Driven Phrase Structure Grammar (HPSG) (Pollard and Sag 1994; Sag, Wasow and Bender 2003) and Lexical Functional Grammar (LFG) (Kaplan and Bresnan 1982; Bres- nan 2001), are often contrasted with derivational theories, such as Government and Binding (Chomsky 1981) and Minimalism (Chomsky 1995). One of the main differences between the two is situated in the view of syntactic struc- ture as constructed or ultimately constrained. Central to a notion of constraint- based syntax is the idea that constraints limit the number of possible grammat- ical structures in a way that corresponds to the system modeled, namely our linguistic competence. The constraint-based theories place much of the con- 2.2 Constraints 17 straining power in the lexicon, where constraints in lexical entries restrict the possible combinatory space in syntactic structure. In much the same way that derivational theories associate restrictions in terms of structural positions along with movement, constraint satisfaction in constraint-based theories is assured by means of unification. The constraints are absolute in the sense that they impose requirements on structure which must be fulfilled. Optimality Theory (OT) operates with a somewhat different view of con- straints. Here the constraints are violable, or ‘soft’, but strictly ranked with respect to each other and a violation of a constraint is possible only to fulfil a constraint that is higher in rank. The interaction of constraints in a ranking is therefore the key to understanding the difference between the two notions of constraints. The principal notion of a constraint as a “structural requirement that may be either satisfied or violated by an output form” (Kager 1999: 9) is thus not shared by the two directions outlined above, since constraint violation excludes any output form in the constraint-based theories. The effect of the constraints on linguistic structure, whether absolute or ranked and violable, however, is common to both of the types of constraint- based theories outlined above. In OT-terms, there is only one output for any given input – both the constraint-based theories and OT operate with a cate- gorical notion of grammaticality. It does not make sense within these theories to speak about varying acceptability of different constructions or outputs. 2.2.2 Soft constraints In contrast to the view of constraints presented above, recent work within Op- timality Theory has focused on the use of soft, in the sense ‘probabilistic’ or ‘weighted’, constraints. In line with the shift towards empirical methods in computational linguistics, focus on the relationship between language data and (OT) grammars has resulted in work on acquisition of constraint rankings from corpus data. Constraints in an OT grammar are ranked in a hierarchy of dominance, related through strict domination (Kager 1999: 22): Strict domination: Violation of higher ranked constraints cannot be compen- sated for by satisfaction of lower-ranked constraints. It follows from the above definition that i) constraint ranking is strict, not variable and ii) constraint violations are non-cumulative. The work on soft, weighted constraints in OT challenges both of these entailments. Soft constraints were initially introduced in OT to model linguistic variation (Boersma and Hayes 2001; Goldwater and Johnson 2003), but has also been 18 Soft constraints applied to syntactic variation (Bresnan and Nikitina 2007; Bresnan, Dingare and Manning 2001; Øvrelid 2004). In order to account for more than one possible output for a given input, i.e., linguistic variation, constraints may be defined over a continuous scale, where the distance between the constraints is proportional to their fixedness in rank. The ranks or weights of constraints are acquired from language data and thus reflect the frequency distributions found in the data.6 Goldwater and Johnson (2003) make use of a Maximum Entropy model to learn constraint weights and model constraint interaction. The use of a Maximum Entropy model for modeling constraint interaction brings us to the second entailment above, namely the issue of cumulativity. It is one of the main tenets of OT that no amount of violations of a lower ranked constraint can cancel out a violation of a higher ranked constraint. This is not, however, a property of most probabilistic models where cost computations of- ten are additive. Jäger and Rosenbach (2006) discuss models for variation in OT and put forward empirical evidence for cumulativity in the syntactic varia- tion of the English genitive alternation. The view is of the alternation as prob- abilistic variation and statistical tendencies in language data are employed as evidence.A distinction between soft and hard constraints has furthermore been introduced in modeling of experimental judgement data, where these are pro- posed to differ in the observable effect that their violations incur on the relative acceptability of a sentence (Keller 2000).7 We thus observe two notions of ‘soft constraint’ emerging in recent dis- course, where the main difference between the two is found in constraint inter- action: Standard OT Constraints are soft in the sense that they may be violated and are strictly ranked. This is the standard sense of a soft constraint which distinguishes between the view of constraints within OT and other con- straint-based theories.8 6We may note, however, that the modeling of linguistic variation does not necessarily de- mand the introduction of probabilistic constraints, although, within an OT setting, it does entail relaxation of the demand for strict ranking. Proposals have been made that employ unranked constraints, however, still ordinal as in standard OT (Anttila 1997). Furthermore, the introduc- tion of probabilistic constraints does not necessitate variable ranking. A categorical OT system with a strict ranking of constraints within a probabilistic setting simply constitutes an extreme where all constraints are ranked so far apart as to be non-interacting. 7Keller (2000) proposes a version of Optimality Theory, Linear Optimality Theory (LOT), where constraints come in two flavours – soft and hard. The weighting of constraints in LOT models numerical acceptability data from Magnitude Estimation experiments (Bard, Robertson and Sorace 1996). Unlike the work discussed above, however, Keller argues that the status of a constraint as soft/hard is not susceptible to cross-linguistic variation; if a constraint is soft in one language, it is soft in another too. So rather than allowing for the soft/hard distinction to follow directly from the weighting of constraints, it is stipulated independently as a universal property of the constraints. 8We may note, however, that OT and constraint-based theories like HPSG and LFG should not be viewed as competitors due to the fact that they operate on different levels. OT is a theory of constraint interaction and not representation and is fully compatible with other representa- tional theories, see for instance work on OT-LFG (Choi 2001; Kuhn 2001). 2.3 Incrementality 19 Probabilistic OT Constraint interaction is furthermore probabilistic, in the sense that • constraints are weighted, • constraint interaction is stochastic (not strictly ranked), • constraint interaction is (possibly) cumulative. Probabilistic OT is thus an extension of Standard OT. We may note that a very similar development can be found in work on automatic, syntactic parsing. As an equivalent to the hard notion of constraints discussed above, a line of work in dependency parsing proposes disambiguation by boolean constraints taken from various linguistic levels of analysis through constraint propagation in a constraint network (Maruyama 1990). Extensions of Maruyama’s approach has included a notion of soft, weighted constraints (Schröder 2002) and some work has also been done on machine learning of grammar weights for these hand-crafted constraints (Schröder et al. 2001). Parsing with a set of weighted constraints, where hard constraints are simply constraints located at the ex- treme end of the scale, recasts the parsing problem as an optimization prob- lem, i.e. locating the best of all possible solutions which maximizes/minimizes a certain scoring function. The parallel to the constraint interaction proposed in OT is obvious when parsing is modeled as an optimization problem where the search space consists of all possible linguistic analyses (Buch-Kromann 2006). 2.3 Incrementality Human language processing and modeling thereof is characterized by incre- mentality; data is presented bit by bit, hence analyses are necessarily based on incomplete evidence. Probabilistic models are typically employed in mod- eling, providing a model of decision making under uncertainty and based on incomplete evidence. Effects of incremental language processing have typi- cally been attributed to performance, along with extra-linguistic factors such as memory load. However, the interest in probabilistic grammars as discussed above, opens for a reevaluation of the competence-performance distinction and its bearing on linguistic theory building: We believe not only that grammatical theorists should be interested in performance modeling, but also that empirical facts about various as- pects of performance can and should inform the development of the the- ory of linguistic competence. That is, compatibility with performance 20 Soft constraints models should bear on the design of competence grammars. (Sag and Wasow 2008: 2) In the following we discuss processing of ambiguity, a problem which has been widely studied in both theoretical, computational and experimental linguistics, hence may be employed to illustrate the demands of incrementality on the nature of constraints and constraint interaction. 2.3.1 Ambiguity processing Ambiguity is a property which is characteristic of natural language, distin- guishing it from formal languages. It consists of a mismatch in the mapping between form and meaning, where one form corresponds to several meanings (Beaver and Lee 2004). Ambiguities in natural language have been widely studied within theoretical linguistics, psycholinguistics and computational lin- guistics. It is a notorious problem within NLP, in particular within the areas of part-of-speech tagging, syntactic parsing and word sense disambiguation. Am- biguity has is seen as one of the main reasons “why NLP is difficult” (Manning and Schütze 1999: 17) and is prevalent at all levels of linguistic analysis. In psycholinguistics, ambiguities have been claimed to increase processing diffi- culty (Frazier 1985) and the study of ambiguity processing has been performed under the assumption that it can be indicative of the underlying architecture and mechanisms of the human language faculty. 2.3.1.1 Types of ambiguity As mentioned, ambiguity is found at all levels of linguistic analysis, ranging from the level of morphemes, so-called syncretism, to semantic and pragmatic ambiguities. Ambiguity with respect to syntactic arguments is, however, in a majority of cases caused by ambiguity in lexical form or in the syntactic environment.9 Lexical ambiguities are ambiguities associated with lexical units which have more than one interpretation or meaning. These types of ambiguities are extremely common, and especially frequent words tend to be polysemous. Cat- egorial ambiguity is found where a word has several meanings, each associated with a distinct category or word class. For instance, time is both a noun and a 9In section 4.1 we examine examples of syncretism in morphological case marking, which directly contribute to functional ambiguity. 2.3 Incrementality 21 verb. Function words are notoriously ambiguous, e.g. to may be both an infini- tival marker and a preposition and that may be a determiner, a demonstrative pronoun and a complementizer (Wasow, Perfors and Beaver 2005). Catego- rial ambiguity has syntactic consequences since the category of a lexical item clearly influences its syntactic behaviour. The example in (1) illustrates the polysemy of the English noun case, and (2) the categorial ambiguity of strikes and idle, which both can be used as a as verb, as well as noun or adjective (Mihalcea 2006): (1) Drunk gets nine years in violin case (2) Teacher strikes idle kids Structural ambiguities are found when a sentence may be assigned more than one structure. These include PP-attachment ambiguities, as in (3), coordination ambiguities, as in (4) and noun phrase bracketing ambiguities, as in (5): (3) The prime minister hit the journalist with a pen (4) Choose between peas and onions or carrots with the steak (5) He is a Danish linguistics teacher 2.3.1.2 Global and local ambiguity Orthogonal to the types of ambiguity discussed above, and hence regardless of the source of ambiguity, we may distinguish between global and local ambi- guity. In the processing of ambiguity in language, and with reference to a sen- tence, local ambiguity obtains when parts of a sentence is ambiguous, whereas global ambiguity is found when the whole sentence is ambiguous, cf. (3)-(5) above. Since human language processing is incremental in nature, local ambi- guities can cause processing difficulties, for instance in so-called garden path sentences: (6) I knew the solution to the problem was correct A garden-path effect is observed when interpretation changes during the in- cremental exposure to a sentence. In (6), the postverbal argument is initially interpreted as an object, but must be reanalyzed as subject of a complement clause when the second verb is encountered. 22 Soft constraints 2.3.1.3 Ambiguity resolution Disambiguation is the process of resolving ambiguities and within NLP many tasks involve disambiguation in some form. Word sense disambiguation, for instance, is solely devoted to the resolution of lexical ambiguities, whereas part-of-speech tagging deals with the subclass of categorial ambiguities. In syntactic parsing, disambiguation is a crucial task which is dealt with in a vari- ety of ways. Irrespective of the particular approach to parsing, disambiguation can be defined as a “process of reducing the number of analyses assigned to a string” (Nivre 2006: 23). In most current approaches to parsing this is achieved by assigning probabilities to the syntactic structure(s), approximated by fre- quency data from language use. Disambiguation is then performed either as a post-processing step over the total of analyses, or as an integral part of the parsing process itself, often in combination with deterministic processing. The processing of ambiguity has been studied extensively in psycholinguis- tic experiments and has been argued to provide evidence for the mechanisms of the human language processor. Important topics in this respect have been the role of frequency in lexical ambiguity resolution and the role of various types of linguistic information in the processing of structural ambiguities. In a seminal article, MacDonald, Pearlmutter and Seidenberg (1994) propose that resolution of lexical and structural ambiguities, contrary to earlier assump- tions, follows the same types of strategies. In particular, language processing can be viewed as a constraint satisfaction problem, where interpretation is con- strained by a set of largely lexical, probabilistic constraints. Needless to say, frequency plays an important role in ambiguity resolution in such a model. 2.3.2 Constraining interpretation We have earlier discussed how frequency effects can affect sentence compre- hension, as well as how language-specific frequency effects, typically assigned to the realm of performance, have been claimed to provide evidence for prob- abilistic grammars of universal competence-oriented constraints. The study of language comprehension raises further questions regarding properties of a comprehensive model of grammar, unifying insights from the study of compe- tence and performance alike. Results from psycholinguistics suggest several properties that are relevant for grammatical constraints to be “performance-compatible” (Sag and Wasow 2008): Surface oriented Processing deals with “what is actually there”. 2.4 Gradience 23 Non-modular Information from all linguistic levels should interact. Lexical Individual words should carry information on their combinatory po- tential, as well as their semantic interpretation. With respect to the theories of constraints discussed in section 2.2 above, we find that both the constraint-based theories employing absolute constraints, and OT, which uses violable ranked constraints, are compatible with these de- mands. LFG and HPSG, being theories of representation, are explicit lexicalist theories, whereas all three are non-modular in not placing any restrictions on the type of information which may interact in parallel.10 One might also take the integration of performance-compatible constraints further and suggest that not only should a grammatical model of competence be compatible with the processing of performance data, but it should in fact be one and the same model (Bod 1995, 1998). An important property is then found in the ability to provide an analysis for sentence fragments and a main concern is that incrementality is incompatible with a categorical notion of grammat- icality, at least one that is defined by hard, global constraints over complete sentences. OT provides one possible approach for such a model (Stevenson and Smolensky 2005; de Hoop and Lamers 2006), due to the fact that con- straints under this approach are violable and therefore provide an analysis for any input, including sentence-fragments. 2.4 Gradience Gradience is employed to refer to a range of continuous phenomena in lan- guage, ranging from morphological and syntactic categories to phonetic sounds. The idea that the language system is non-categorical has been promoted within several subdisciplines of linguistics – phonology, sociolinguistics, typology and gradient categories have been examined at all levels of linguistic represen- tation (Bod, Hay and Jannedy 2003). 2.4.1 Grammaticality We have discussed the implications of a probabilistic grammar expressed in terms of constraints on linguistic structure. One implication of such a view is a gradient notion of grammaticality. 10These theories are ‘lexicalist’ in the sense that they place much of the explanatory burden in the lexicon, i.e. the lexical entries contain a majority of the information needed to interpret a sentence. They are also lexicalist in the sense that they adhere to the principle of Lexical Integrity (Bresnan 2001); words are the smallest units of syntactic analysis and the formation of words is subject to principles separate from those governing syntactic structures. 24 Soft constraints Whereas, ‘degrees of grammaticalness’ (Chomsky 1965, 1975), has played a certain role in generative theoretical work, there has been no systematic in- corporation of such notions in the proposed grammatical models. Manning (2003) argues for the use of probabilistic models to explain language structure and motivates his claims by the following observation: Categorical linguistic theories claim too much. They place a hard cate- gorical boundary of grammaticality where really there is a fuzzy edge, determined by many conflicting constraints. (Manning 2003: 297) The concern that introduction of probabilities into linguistic theory will intro- duce chaos is unfounded, according to Manning (2003). Rather, a probabilistic grammar can be seen to broaden the scope of linguistic inquiry, and doing so in a principled manner. A probabilistic view of grammaticality can thus provide more fine-grained knowledge about language and the different factors which interact. 2.4.2 Categories Linguistic category membership can also be gradient in the sense that elements are members of a category to various degrees. In general, we find gradience between two categories α and β when their boundaries are blurred. By this we mean that some elements clearly belong to α and some to β , whereas a third group of elements occupy a middle ground between the two. The intermediate category possesses both α-like and β -like properties (Aarts 2004). In work on descriptive grammar it is often recognized that taxonomic re- quirements of linguistic categories are problematic; elements do not all neatly fall into a category and some elements have properties of several categories. For instance, it is well known that providing necessary and sufficient criteria for membership in part-of-speech classes is difficult and a view of these crite- ria as graded, or weighted, was proposed as early as in Crystal 1967. Prototype theory, following influence from psychology, has been influential in cognitive linguistics (Lakoff and Johnson 1980; Lakoff 1987) and promotes precisely the idea that membership in a category is not absolute, but rather a matter of gradience. Moreover, gradience is defined with reference to a prototypical member of a category. One response to graded phenomena which maintains a sense of categoricity is the introduction of split categories. For instance, in LFG, phrasal categories may be both functional and lexical in terms of the notion of ‘co-heads’, and HPSG allows for multiple inheritance in type hierarchies. 2.5 Conclusion 25 2.5 Conclusion The empirical shift mentioned initially is evident in work ranging from the- oretical and experimental approaches to computational modeling of natural language. The work described in this thesis adheres to an empiricist methodol- ogy, focusing on the essential role of language data in linguistic investigations. Furthermore, we ascribe to a view of language where linguistic structure is de- termined by a set of, possibly conflicting, constraints. In chapter 3 we examine the linguistic dimensions of argument differentiation, an area which has been proposed to be influenced by constraints on linguistic structure which show up as frequency effects in a range of different languages. The main parts of this thesis will be devoted to the investigation and computational modeling of argu- ment differentiation. In particular, we employ data-driven models taken from computational linguistics, which support a direct relation between frequency of language use and linguistic categories. Data-driven models rely on statistical inference over language data, com- bining different sources of information and can in this respect be seen to ex- press soft, probabilistic constraints. Within the area of syntactic parsing, com- putational models of incremental parsing may be studied to elucidate proper- ties of constraints further. In chapter 8, we introduce data-driven dependency parsing (Nivre 2006) as an instantiation of such a model. We will study argu- ment disambiguation and investigate the effect of various types of linguistic information. The linguistic features employed in the study of argument dis- ambiguation in chapter 9 are theoretically motivated and furthermore surface- oriented, lexical and non-modular. The direct relationship in data-driven models between frequency of lan- guage use and categories furthermore enables a study of gradience. We will in the following chapters discuss categorial gradience in several places, and in particular with respect to semantic properties, such as animacy and selectional restrictions. 3 LINGUISTIC DIMENSIONSOF ARGUMENTDIFFERENTIATION This chapter presents argument differentiation and its linguistic dimensions. We start out by briefly introducing the notion of argumenthood and discuss some further distinctions within the category of arguments. The introduction of the term ‘argument differentiation’ is motivated and we go on to discuss several linguistic factors which have been proposed to differentiate between the arguments of a sentence. We discuss the factors independently, as well as their interaction in the context of argument differentiation. This chapter thus introduces terminology which will be employed in the following and provides theoretical motivation for the linguistic properties which will be investigated in Part II and Part III of the thesis. 3.1 Arguments A distinction between arguments and non-arguments is made in some form or other in all syntactic theories.11 The distinction can be expressed through struc- tural asymmetry or stipulated for theories where grammatical functions are primitives in representation. For instance, in LFG (Kaplan and Bresnan 1982; Bresnan 2001), grammatical functions are primitive concepts and arguments or governable functions (SUBJ, OBJ, OBJθ , OBLθ , COMP, XCOMP) are distin- guished from non-arguments or modifiers (ADJ, XADJ). HPSG (Pollard and Sag 1994; Sag, Wasow and Bender 2003) similarly distinguishes the valency features (SPR, COMPS) from modifiers (MOD). In most versions of dependency grammar, (see, e.g, Mel’c˘uk 1988; Hudson 1990), grammatical functions are also primitive notions and not derived through structural position.12 Regardless of notation, the notion of argumenthood is important in syntac- tic theory and is closely related to the semantic interpretation of a sentence. 11We adopt the more theory-neutral term of ‘non-argument’, rather than ‘adjunct’, which is closely connected to the structural operation of adjunction. 12For a brief introduction to dependency grammar, see section 5.1.1. 28 Linguistic dimensions of argument differentiation Dalrymple (2001) cites Dowty (1982) in proposing two tests for argument- hood: (7) Tests for argumenthood (Dowty 1982): (i) Entailment - the existence of an argument is entailed by the predicate (ii) Subcategorization - arguments are obligatory, non-arguments are optional These two tests decompose the notion of an argument, positioning it in the syntax-semantics interface. The entailments of a predicate are closely related to the argument structure of a predicate, which characterize the core partici- pants, or thematic roles, involved in an event. The subcategorization of a verb relates to the obligatoriness of an argument, hence constrains the syntactic re- alization of the event. Neither test, however, provides a sufficient criterion by which to distinguish arguments from non-arguments. The entailment test is not strict enough, for instance allowing for time adverbials to be arguments since all events entail a location in time and space. The subcategorization test, on the other hand, is too strict in excluding, for instance, arguments of verbs like eat, which may function intransitively. As Dalrymple (2001) notes, both of these tests still make some valid predictions: “if a phrase is an argument, it is either obligatorily present or it is entailed by the predicate. If a phrase is a modifier, it can be omitted” (Dalrymple 2001: 12).13 Other tests for the argument/non- argument distinction include iteration and reordering of non-arguments (Sag, Wasow and Bender 2003). Cross-linguistic generalizations relating to grammatical functions often make reference to a hierarchy, such as the one in (8) below (Keenan and Comrie 1977; Bresnan 2001):14 13Due to the amount of variation exhibited by different verbs in their subcategorization frames, Manning (2003) proposes a probabilistic view of argumenthood, according to which the exceptions from tests like the ones in 7 simply represent less prototypical, or likely, argu- ments. 14The hierarchy in (8) is taken from Bresnan 2001 and differs from the original hierarchy (Keenan and Comrie 1977) in the inventory of object functions. Keenan and Comrie (1977) employ the distinction between direct and indirect objects, and impose the ordering OBJ > IOBJ on these. This distinction takes semantic role to be indicative of function, and groups together the theme object of a ditransitive verb with the object of a monotransitive verb. An alternative distinction is made between primary and secondary objects in typological work on grammatical functions (Dryer 1986) and is also the main distinction made in LFG, where OBJθ is the secondary object. It is argued that in many languages indexation and case-marking group the indirect object with the monotransitive object (primary objects) and treats these as distinct from the secondary (direct) object in ditransitive constructions. English has been argued to follow both of these, with evidence in the dative alternation illustrated in (10)–(11). 3.1 Arguments 29 (8) SUBJ > OBJ > OBJθ > OBL > COMP > ADJ The main idea behind such a hierarchy is that a generalization which applies to an element on the scale will also apply to the elements to the left of it.15 Grammatical hierarchies may also be interpreted as expressing the relative prominence of the ranked elements. In this case, prominence can be defined structurally, but recent proposals have highlighted highly ranked elements as being cognitively accessible, see section 3.6 below. We distinguish between core and non-core argument functions (Bresnan 2001). Subjects and objects (direct and indirect) are the core functions, whereas various oblique functions, as well as clausal complements, are non-core. Phe- nomena which differentiate core from non-core arguments are possibilities for verb agreement, anaphoric binding patterns and control (Dalrymple 2001). There are also reasons to distinguish the subject relation from those of the other argument relations, as the external argument. The external argument is in theories such as HPSG assigned a relation (SPR), which groups it with deter- miners. This is a clear feature structure translation of the structural asymmetry expressed in ¯X theory as holding between specifiers and complements. The differentiation of the subject from the other argument functions, however, is not only based on structural assumptions. Subjects exhibit linguistic proper- ties which differentiate them from the other argument functions, such as direct objects. Phenomena which only the subject participates in thus have a “cut- off point” after the first element in the hierarchy. These phenomena include verb agreement in a range of languages (including English), honorification in Japanese, as well as raising and control phenomena.16 We introduce the term argument differentiation and will in the following employ it as a neutral cover term to denote the process by which arguments are distinguished along one or more linguistic dimensions. The rationale be- hind the introduction of this term reflects the mediating status of arguments between syntax and semantics. First of all, argument differentiation will be employed as neutral with respect to theory or application, as opposed to terms like ‘interpretation’ or ‘disambiguation’, which are more or less theoretical and applied terms, respectively. We also, as mentioned initially, wish to maintain a non-modular orientation in the following, and argument differentiation reflects this orientation in not taking syntactic or semantic evidence to be primary. This allows us to generalize over mapping from meaning to form, as expressed by ‘realization’, and from form to meaning, known as ‘interpretation’, as well as 15The particular hierarchy in 8 relates to accessibility of grammatical functions for relativiza- tion, e.g. if direct objects may be relativized in a language, then it will also be possible to rela- tivize the grammatical subject etc. The original hierarchy also includes genitive modification. 16It is only the subject of the subordinate clause which may be raised or controlled. 30 Linguistic dimensions of argument differentiation terminology employed in work on psycholinguistic processing, e.g., ‘produc- tion’ and ‘comprehension’. We will employ the above terms when appropriate to make clear the exact application in the specific context. 3.2 Animacy The dimension of animacy roughly distinguishes between entities which are alive and entities which are not, however, other distinctions are also relevant and the animacy dimension is often viewed as a continuum. Animacy is a grammatical factor in a range of languages and is closely related to argument realization and differentiation. In this section we examine how animacy influ- ences language, with focus on argument differentiation, and we furthermore examine some properties of the category of animacy itself. The effect of animacy in linguistic phenomena has been noted several places in the literature and we provide a few examples of this below. For a more detailed overview, see Yamamoto 1999. Typological work on animacy often makes reference to an animacy hierarchy or scale, following Silverstein 1976. An example of an animacy hierarchy, taken from Aissen 2003, is provided in (9):17 (9) Human > Animate > Inanimate Evidence for this hierarchy comes from cross-linguistic examination of the re- alization of animacy in different languages, and especially of how animacy motivates morphological and/or functional “splits” in various ways. The scale in (9) generalizes over phenomena which are influenced by the animacy of the referents involved by providing a set of implications following from different cut-off points on the hierarchy. For instance, number marking may be sensitive to animacy in the sense that elements with human or animate reference ex- hibit number distinctions not possible for elements with inanimate reference, as found in, e.g., Tiwi and Kharia (Yamamoto 1999). The phenomenon known as Differential Object Marking (Aissen 2003; Comrie 1989) provides another example, where the morphological case marking of direct objects may be de- termined by animacy and where different languages exhibit different cut-off points on the above hierarchy. For instance, in the Dravidian language Malay- alam, we find case marking of objects which are human and animate referring and objects with inanimate reference are unmarked for case. 17Comrie (1989) calls the middle category in the hierarchy Animal, whereas Aissen uses the term Animate. As we shall see, the intermediate category need not be limited to animals, but rather highlights the gradient nature of the animacy dimension, see section 3.2.4. 3.2 Animacy 31 3.2.1 Animacy of arguments A recent special issue of the linguistic journal Lingua was dedicated to the topic of animacy and discusses the role of animacy in natural language from rather different perspectives, ranging from theoretical and typological to exper- imental studies (de Swart, Lamers and Lestrade 2008). These various perspec- tives all highlight animacy as an influencing factor in argument differentiation. There is a cross-linguistic tendency for external arguments or subjects to be human or animate and for objects to be inanimate (Comrie 1989). de Swart, Lamers and Lestrade (2008) cite examples from languages like Jakaltek, where inanimate subjects are simply ungrammatical, but where human/animate sub- jects are perfectly grammatical. We may distinguish between the effect of isolated versus relative animacy, that is, whether the animacy of an isolated element determines an effect or whether it is the animacy of one argument relative to another which creates an effect in a language. For instance, in the Mayan language MamMaya, a transitive sentence is ungrammatical if the object is higher in animacy than the subject, as in The dog sees the woman (de Swart, Lamers and Lestrade 2008). In Navajo, such a construction is clearly avoided and an alternative construction (The woman is seen by the dog) is chosen instead.18 In many languages this tendency is reflected in language data as a frequency effect, even though these types of transitive constructions are perfectly grammatical (Dahl and Fraurud 1996; Øvrelid 2004). Following a corpus study of animacy in Swedish, Dahl and Fraurud (1996) conclude that: [M]ore than 97% of all transitive sentences obey the constraint that the subject should not be lower than the object in animacy. Thus, this con- straint, which is grammaticalized in a language such as Navajo, could be said to be approximated statistically in Swedish texts. (Dahl and Fraurud 1996: 53) Animacy furthermore has an effect on the differentiation of core and non-core arguments. Bresnan et al. (2005), for instance, argue that animacy is an im- portant factor in the so-called dative alternation in English, clearly influencing the choice between expression the double object construction in (10) and the prepositional dative structure in (11): (10) . . . gave the prime minister a pen 18The inverse construction in Navajo can be paraphrased by our passive construction and is expressed by the verbal affix bi and employed when the subject is lower in animacy than the object (Dahl and Fraurud 1996). 32 Linguistic dimensions of argument differentiation (11) . . . gave a pen to the prime minister 3.2.2 Ambiguity resolution The influence of animacy in both language production and comprehension has been widely investigated in psycholinguistic studies. By manipulating the ani- macy of elements in otherwise controlled environments, the effect of animacy on syntactic structure may be studied experimentally. Animacy effects have played an important role in the debate in psycholin- guistic theory between two different views of language processing and the modeling thereof – a serial, modular or “syntax-first” model versus a single- stage model, where different kinds of information from different linguistic lev- els, such as syntax, semantics and pragmatics, interact. In particular, the fact that animacy, being a lexical and semantic property, influences syntactic inter- pretation has been taken as evidence for the latter type of model. Abandonment of modular processing models has characterized psycholinguistic work in the last decade, and the use of new types of processing evidence19 has enabled even more detailed results on the use of various information sources during language processing (Sag and Wasow 2008). In comprehension studies, animacy has been shown to have a clear effect on the resolution of grammatical function ambiguities. The tendency for animate elements to be syntactically prominent, as discussed in the preceding section, is shown to provide an important information source in disambiguation. Weck- erly and Kutas (1999) report results from ERP experiments on the compre- hension of English object relatives20 and argue that a probabilistic constraint- based, interactional model is most appropriate for modeling the influence of animacy on choice of syntactic structure. They find an early effect of animacy, independently of the verb, which expresses the clear correlation between an- imacy and syntactic function assignment. Inconsistencies between syntactic and semantic information, following the cross-linguistic tendencies outlined in the previous section, result in clear experimental effects. Mak, Vonk and Schriefers (2006) come to similar conclusions after studying the effect of an- imacy on the processing of Dutch relative clauses using a self-paced reading task. They find that the relative animacy of the entities in question is most 19Event-Related brain Potentials (ERP) and eye-movement tracking are examples of exper- imental methods which allow for on-line and more precise measures of processing activity, through electrical and muscular activity, respectively. 20Object relative constructions are tested with differing animacy of the extracted object, as well as the subject of the relative clause, as in the poetry that the editor recognized . . . vs. the editor that the poetry baffled . . . (Weckerly and Kutas 1999). 3.2 Animacy 33 important and can even counteract the usual preference for subject-relatives which has a strong influence on processing.21 3.2.3 The nature of animacy effects There are languages where animacy creates hard, categorical effects on the re- alization of arguments. These hard effects are found in particular in the encod- ing of core arguments through morphology, as in the phenomenon of Differ- ential Object Marking mentioned above. In most theoretical and experimental work where animacy figures, however, it is claimed to be a soft, probabilistic constraint on structure, with evidence in frequency effects either in corpus data or in experimental results. Animacy is then argued to influence the choice of syntactic structure and the realization or interpretation of syntactic arguments. Theoretical studies have examined the influence of animacy in a range of syntactic constructions in various languages. The focus has in particular been on various grammatical alternations, such as the active-passive alterna- tion (Bresnan, Dingare and Manning 2001), the dative alternation (Bresnan and Nikitina 2007; Bresnan et al. 2005; Bresnan 2006) and the genitive alter- nation (Rosenbach 2003, 2005, 2008). Here, the choice of construction is seen as depending on several factors and central themes in this work is investigat- ing the effect of these factors by assessing their predictive strength and teasing them apart with reference to the particular construction under scrutiny. Experimental studies involving animacy share with the theoretical work discussed above the assumption that animacy influences the choice of syntac- tic structure, and, in particular, the differentiation of arguments. Whether from the perspective of language production, where the outcome is directly observ- able, or from that of comprehension where the chosen analysis shows up as a response to language data, the probabilistic effect of animacy has been re- ported in numerous studies (Branigan, Pickering and Tanaka 2008; Weckerly and Kutas 1999; Mak, Vonk and Schriefers 2006). As we have seen, animacy has a hard effect on the realization of arguments in many languages and it is also evident in distributional tendencies in a range of languages. One possibility which has been explored in recent studies is that the hard and soft effects of animacy are instances of the same constraints on language and that “soft constraints mirror hard constraints” by simply hav- 21Dutch subject and object relative clauses do not differ in word order, e.g. de wandelaars, die de rots beklommen hebben ‘the hikers, who climbed the rock’ versus de rots, die de wandelaars beklommen hebben ‘the rock, that the hikers climbed’ (Mak, Vonk and Schriefers 2006). This means that in processing these, the choice of analysis as subject or object relative has to be made, unlike English where these are structurally unambiguous. 34 Linguistic dimensions of argument differentiation ing varying strength in different languages (Bresnan, Dingare and Manning 2001). For instance, a language like Lummi has grammaticalized the prefer- ence for local (1st/2nd person) subjects, where a passive construction with a demoted local subject is ungrammatical. In English, the avoidance of this type of construction, as in The car was bought by me, constitutes a strong statisti- cal tendency (Bresnan, Dingare and Manning 2001). With respect to animacy, the constraint on relative animacy grammaticalized in for instance Jakaltek and MamMaya, is observed as a statistical tendency in corpus data (Dahl and Fraurud 1996; Øvrelid 2004). Even if categorical animacy effects in language are rare, it is clear that probabilistic effects of animacy have been observed in numerous languages and that frequency data from language use can be employed as linguistic ev- idence for such a claim. A view of linguistic structure as determined by a set of interacting, soft constraints captures these observations in a comprehensive model. 3.2.4 Gradient animacy The animacy hierarchy presented in (9) above consists of three categories – hu- man, animate and inanimate. However, these are by no means static and given a priori. We are interested in animacy first and foremost as a linguistic cate- gory and how it is reflected in language. The hierarchy therefore reflects the categories which are deemed relevant in linguistic phenomena which are sen- sitive to the dimension of animacy. We shall see later on that animacy interacts with several other linguistic dimensions in argument differentiation. However, it can be useful to be able to separate out animacy as an independent prop- erty which is inherent of nouns. In this respect, we make a distinction between denotational as opposed to referential properties. Denotational properties hold for lexemes and are context-independent, i.e., independent of the particular linguistic context, whereas referential proper- ties are determined in context and hold for referring expressions, rather than lexemes (Lyons 1977). These terms are clearly related and are often used in- terchangeably. Denotational properties are important in reference, and what is referred to in a given context is always within the denotation of at least one lexeme (Lyons 1977). For instance, ‘doctor’ contributes a bulk of semantic information in a referring expression such as ‘her doctor’. Nevertheless, this distinction is useful in discussing gradience within the category of animacy. We thus distinguish between animacy as a denotational property of lexemes, e.g. it is in the denotation of ‘doctor’ that it refers to a human, and animacy as a referential property of referring expressions, e.g., that ‘her doctor’ refers to a 3.2 Animacy 35 particular human being in a particular context. In particular, we shall see that these need not always coincide. In section 2.4, we discussed gradience of linguistic categories. We find that the animacy dimension exhibits gradience cross-linguistically, as well as within languages. For linguistic phenomena which are sensitive to animacy, there is a certain degree of variation between languages that seems to be cul- turally determined. For instance, Persian has been cited to treat trees linguis- tically as animates (Rosenbach 2002) and number marking in the Papuan lan- guage Manam is sensitive to the categories of human and ‘higher’ animals (Croft 2003). It is suggested then that the animacy hierarchy is “not an order- ing of discrete categories, but rather a more or less continuous category ranging from most animate to least animate” (Croft 2003: 130). Collective nouns, like committee and family pose some interesting problems for semantic theories, due to their dual nature in denoting both a group and a collection of individ- uals. This duality is reflected in the possibility for collective and distributive predication, respectively. With respect to animacy, collective nouns are can- didates for an intermediate animacy status, something which is reflected in annotation schemes for animacy, see section 7.1.1. They also vary in their sta- tus cross-linguistically, and Yamamoto (1999) finds that Japanese tends to treat collectives more like inanimates than English.22 The fact that a referring expression may be employed to refer to entities of varying animacy, should be kept separate from the gradience of denotation discussed above. With respect to various processes of referential shifts, such as metaphor and metonymy, the context may override denotational properties in determining reference for an expression: (12) The ham sandwich is sitting at table 20 The classic example in (12), taken from Nunberg 1979, employs metonymy in that it uses “one entity to refer to another that is related to it” (Lakoff and Johnson 1980: 35). It is clear, however, that the inherent, denotational animacy of the noun ham sandwich is not gradient, and it is exactly this property which makes non-literal language possible, since metonymy often involves a viola- tion of the semantic selectional restrictions of the verb (Fass 1988). The example in (12) is an instance of creative metonymy. However, meto- nymy is also a regular process with a set of conventionalized patterns which re- cur in language (Lakoff and Johnson 1980). In fact, a corpus study of metonymy in English found that 20% of country names and 30% of organization names were employed metonymically (Markert and Nissim 2006). Commonly occur- 22Yamamoto (1999) examines an English-Japanese parallel corpus and performs a contrastive corpus study of the use of referring expressions in the two languages. 36 Linguistic dimensions of argument differentiation ring patterns are ‘place-for-people’, as in (13), and ‘organization-for-members’, as in (14) below (Markert and Nissim 2006): (13) America did once try to ban alcohol (14) Last February NASA announced [. . . ] Note, however, that proper names are not traditionally assumed to have a deno- tation separate from their reference, neither are pronouns (Lyons 1977). In the parlance of truth-conditional semantics, referring expressions do not predicate of their referent. However, processes of metonymy, seem to presuppose a de- notation of some kind. We furthermore find these types of regular metonymical extensions for common nouns:23 (15) Kyrkan church-DEF menar thinks att that båda both dessa these riktningar directions var were positiva positive tillgångar assets ‘The church feels that both of these directions were positive assets’ With respect to gradience of animacy, we may ask whether regular metonymic patterns of this type influence the semantics of nouns to such an extent that this is rather a matter of polysemy than referential shifting in a particular context. It is interesting to note that these nouns, as well as the proper nouns, have a collective meaning, a category with gradient animacy properties, as noted above. In computational semantics, the distributional hypothesis represents a com- mon assumption about meaning which proposes that words with a similar dis- tribution also have similar meanings. On this hypothesis, the denotation of a linguistic element then is reduced to the set of possible contexts in which the element may occur. Chapters 6 and 7 deal with the acquisition of animacy information for nouns based on linguistic distribution. In particular, we will see that a denotational treatment of the category of animacy enables an in- vestigation into the gradience of the animacy dimension which abstracts over individual linguistic contexts. 3.3 Definiteness Central to a notion of definiteness is the property of identifiability, see, e.g., the discussion in Lyons 1999. A referent is identifiable if the hearer is familiar with 23The example in (15) is taken from the Swedish treebank Talbanken05. See section 5.1 for a general overview of corpora employed in this thesis and section 5.1.1 for more detail on the Talbanken05 corpus in particular. 3.3 Definiteness 37 identified by definite specific indefinite specific indefinite non-specific speaker + + − hearer + − − Figure 1: The ‘identifiability’ criterion for definiteness and specificity (von Heusinger 2002: 249) the referent, or, based on the situation of the utterance, the previous discourse or general background knowledge, the hearer is able to work out the referent of the noun. Another often mentioned characteristic of definiteness, is that it involves an implication of uniqueness, i.e. that the referent is in some sense unique in a certain context (Lyons 1999): (16) I’ve just been to a wedding. The bride wore blue. Clearly, this is not a matter of the hearer identifying the referent of the definite noun phrase, but rather acknowledging that there is usually only one bride at a wedding. The following grammatical hierarchy for definiteness is presented in Croft (2003: 132): (17) Definite > Specific Indefinite > Non-specific Indefinite Specificity has been a widely studied subject in formal semantics, see von Heusinger 2002 for an overview and references therein. With respect to the earlier mentioned criterion of identifiability, the difference between the three categories in the hierarchy can be schematized as in figure 3.3. Under this cri- terion, the specificity distinction for indefinites is linked to the speaker having a more precise conception of the referent. The notion of identifiability has, however, been debated as a criterion for definiteness. Different approaches have provided a more discourse-oriented view of definiteness, highlighting properties such as familiarity and salience, which underline the role of the discourse context in definiteness and how the degree of definiteness may be equated with cognitive status (Gundel, Hedberg and Zacharski 1993). A hierarchy focusing on these discourse-oriented proper- ties is represented by the ‘givenness hierarchy’ in (18) (Gundel, Hedberg and Zacharski 1993: 275): (18) in focus > activated > familiar > uniquely identifiable > referential > type identifiable 38 Linguistic dimensions of argument differentiation As we shall see in section 3.4, languages exhibit highly conventionalized ways of referring, depending on definiteness, and cognitive status affects the formal realization of referring expressions in a systematic manner. We distinguish between semantic and formal definiteness. Semantic defi- niteness can be marked formally in a language in different ways, for instance through morphological marking, but the two are not necessarily isomorphic. In the following we will discuss notions of semantic definiteness and the way definiteness interacts with argumenthood.24 3.3.1 Definite arguments Definiteness is not as commonly recognized as a factor in argument differen- tiation as animacy. A tendency towards definite subjects has, however, been noted for several languages, both as a categorical constraint influencing mor- phological marking and as a statistical tendency. Common to these is the same generalization, namely a tendency for subjects to be definite or specific and for objects to be indefinite. In Turkish and Persian, we find Differential Ob- ject Marking which is sensitive to definiteness and where definite objects are marked with accusative case, but indefinite objects are not (Croft 2003). A range of languages have been noted to categorically exclude or strongly dis- prefer non-specific indefinite subjects (Aissen 2003). There are clear correlations between information-flow in a sentence and ar- gumenthood; subjects tend to represent old information and objects tend to introduce new information. Since subjects tend to precede objects as well, it can be difficult to establish this influence independent of ordering. Weber and Müller (2004) present a corpus study of word order variation in German main clauses, indicating that formal definiteness to a greater extent correlates with grammatical function than linear order, and furthermore that formal def- initeness and givenness of information tend to coincide. A discourse-oriented definition of definiteness is also employed in the aforementioned study of the dative alternation in English, where givenness is shown to be a factor in the choice between core and non-core argument realization (Bresnan et al. 2005), cf. examples (10) and (11). The avoidance of an indefinite subject has been argued to constitute one factor in the choice of existential or presentational constructions in the Scan- dinavian languages (Sveen 1996; Mikkelsen 2002), illustrated by a Swedish example in (19) and a Norwegian example in (20):25 24See chapter 4 for more on formal marking of definiteness in Scandinavian. 25The Swedish example in (19) is taken from the Talbanken05 treebank and the Norwegian example in (20) is taken from the Oslo Corpus. See chapter 5 for descriptions of these corpora. 3.4 Referentiality 39 (19) Det it finns exists olika different slags sorts barnhem orphanages ‘There are different kinds of orphanages’ (20) Det it oppsto occurred brudd break mellom between stoffet substance-DEF og and tankveggen tank-wall-DEF ‘A break occurred between the substance and the wall of the tank’ The presentational construction contains an expletive subject and a postver- bal, logical subject occurring in object position. The object position in these constructions may only be occupied by an indefinite argument.26 3.4 Referentiality The difference between the denotation and reference of an expression was dis- cussed above and we may furthermore note that there are (at least) three related senses of ‘referentiality’ in the literature: 1. level of context-dependence 2. specificity 3. meaningfulness First, referentiality may be employed to make the distinction between refer- ring expressions and elements which are not referring in the sense that they rely only on denotational properties for semantic interpretation. A grammati- cal hierarchy of referentiality expressing its influence in various linguistic phe- nomena is presented in (21) (Croft 2003: 130):27 (21) pronoun > proper name > common noun This sense of referentiality, then, relates to the extent to which semantic in- terpretation requires access to the context of the utterance. This is related to the expression of definiteness, or level of cognitive status, as discussed in sec- tion 3.3 above. Pronouns have to be resolved by the context, proper nouns rely on a conventional mapping to a referent, whereas the interpretation of com- mon nouns relies the least on context and more on denotation. Sense 2 (Givón 1984) distinguishes between referential and non-referential indefinites, largely 26See section 8.3.1 and examples for more on the distribution of formal subjects in Swedish. 27Croft (2003) provides several examples of linguistic phenomena which support the hier- archy in (21). Number marking is often sensitive to this dimension; for instance, pronouns in Usan distinguishes number, whereas common nouns do not (Croft 2003: 128). 40 Linguistic dimensions of argument differentiation synonymous with the distinction of specificity discussed above. Sense 1 and 2 are thus related in that they focus on ‘ways of referring’ to an entity. The term ‘non-referential’ is also employed in the sense ‘semantically empty or null’ (Sag, Wasow and Bender 2003) and with particular reference to the dis- tinction between referential, as in (22) and non-referential, or expletive, pro- nouns, as in (23) below: (22) It bothered us for days (23) It is hard to sleep In the following we will employ the term ‘referentiality’ in sense 1 above, expressing the degree of context-dependence of an expression. Whereas we recognize, in line with von Heusinger (2002), that specificity differs from def- initeness, at least in a discourse-related sense, we will not delve further into a discussion of how to make this finer subdivision. We will furthermore make ex- plicit when referentiality in sense 3 is employed by specifying the application of the term, e.g., ‘non-referential it’. 3.4.1 Referentiality and arguments Syntactic arguments differ with respect to their referentiality. As mentioned in section 3.3, the definiteness or cognitive status of an element influences its referentiality. In particular, subjects are likely to be pronominal and objects are more likely to express a lower referentiality (Keenan 1976). The category of pronouns may be further subdivided along the dimension of person which distinguishes reference to the speaker and hearer (i.e. discourse participants) from others (Croft 2003: 130): (24) 1st/2nd (local) person > 3rd person We find that subjects cross-linguistically tend to be expressed by a local per- son, and more so than objects. This tendency has been attributed to the ‘ego- centricity’ of human discourse (Dahl 2000) or, less cynically perhaps, to no- tions of empathy (Kuno and Kaburaki 1977); we tend to speak about our- selves, a fact which is reflected in our choice of referring expressions. The active/passive alternation has been shown to be influenced by person, where passive voice is strongly dispreferred, or even ungrammatical, when the sub- ject is local (Bresnan, Dingare and Manning 2001). Furthermore, the dative alternation is influenced by referentiality and in particular on the distinction between pronominal and non-pronominal expression (Bresnan et al. 2005). 3.5 Relational properties 41 3.5 Relational properties The above sections have focused on properties of arguments of verbs which are inherent to the arguments. However, an argument is defined as such because it stands in a syntactic relation to a particular predicate. Relational properties are properties which are not inherent to the argument itself, but rather describe facets of the relation of an argument to its predicate. Semantic roles express the semantic relation between an argument and a particular predicate, often expressed as a lexical property of the predicate. There is a consistent relationship between thematic roles and syntactic func- tions, expressed in syntactic theory as a theory of mapping, e.g., Linking The- ory (Baker 1997) and Lexical Mapping Theory in LFG (Bresnan and Kanerva 1989). These mapping theories express a direct relationship between promi- nence of thematic role and syntactic function.28 In particular, we find that the most prominent thematic role is mapped to the subject function, the exter- nal argument. Central to a mapping of thematic roles to syntactic functions is therefore often a hierarchy of thematic roles, like the one in (25) below (Bres- nan 2001: 307):29 (25) Agent > Benefactive > Experiencer > Instrument > Theme > Location The semantic restrictions posed by the verb on its arguments are called selec- tional restrictions.30 The idea that verbs select for arguments of a specific se- mantic category has been explored in semantic theories (Katz and Fodor 1963). In lexicalist theories, this notion has been somewhat more developed and rests on the idea that the verb is the head of the clause and is specified lexically for certain selectional restrictions. Animacy has figured in the expression of selec- tional restrictions from the very beginning. Chomsky (1965) operates with the categories of [±Animate] and [±Abstract] in selectional restrictions for verbs and Katz and Fodor (1963) distinguish the categories of ‘Human’ and ‘Higher Animal’ explicitly, as well as their hypernym ‘Physical Object’. 28Syntactic prominence in theories of semantic role mapping may be expressed as phrase- structural prominence, i.e. c-command or with reference to a hierarchy of syntactic functions, such as the one in (8) above. 29Aissen (1999) also makes use of a hierarchy of thematic roles, albeit a simple one, inspired by Dowty (1991): Agent > Patient. 30There is variation in the terminology employed in the literature to refer to selectional re- strictions. Other terms include ‘selectional constraints’ (Resnik 1996) and ‘selectional prefer- ences’ (Erk 2007). We employ the term ‘selectional restrictions’ in the following as we we take it to be more neutral with respect to formulation (absolute vs. gradient) and application (theoretical vs. computational/applied). 42 Linguistic dimensions of argument differentiation In computational work, automatic acquisition of selectional restrictions, mainly from corpus data, has been further investigated. Resnik (1996) first proposed an approach to acquisition of selectional restrictions with class in- formation taken from the English WordNet (Fellbaum 1998). It is based on an information-theoretic approach to verbal argument selection and quantifies the extent to which a predicate constrains the semantic class of its arguments as its selectional preference strength.31 The approach highlights the gradient nature of the selectional restrictions posed by a verb on its arguments, rather than providing absolute constraints. 3.6 Interaction and generalization The above sections have presented the various linguistic properties indepen- dently. However, it is clear that they interact, in particular with respect to ar- gument differentiation. Various proposals in the literature have also attempted to reduce these factors to one general, more or less explanatory principle. We examine both the interaction and generalization over these linguistic properties below. 3.6.1 Interaction In the above sections, we have observed very similar patterns for all of the three factors of animacy, definiteness and referentiality, showing differential tendencies with respect to several distinctions in argumenthood. The original animacy hierarchy proposed in Silverstein 1976 included not only information on animacy, but also on referentiality: (26) 1/2 person pronoun > 3rd person pronoun > proper names > human common noun > animate common noun > inanimate common noun The above hierarchy combines the factors of animacy and referentiality and provides generalizations over a range of linguistic phenomena, such as number marking (Croft 1990). The fact that this hierarchy has been called an animacy hierarchy is somewhat misleading however; the referents of pronouns are not more animate than the referents of human nouns. However, it is clear that a high level of referentiality often co-occurs with human reference and that this tendency is clearly observable in language use. Aissen (2003) makes use of a more fine-grained scale for definiteness, in- corporating information regarding definiteness and referentiality, as well as a 31More details on acquisition of selectional restrictions in section 9.2.8. 3.6 Interaction and generalization 43 notion of specificity to account for Differential Object Marking in languages such as Turkish and Hebrew: (27) Personal Pronoun > Proper Noun > Definite NP > Indefinite Specific NP > Indefinite Non-Specific NP Object marking in these two languages exhibit differing cut-off points on the hierarchy; in Turkish object marking distinguishes definites from indefinites, whereas in Hebrew the same distinction is made with reference to specificity. It is clear that these linguistic factors interact and are interdependent in a way that makes their effect difficult to reduce to a single, well-behaved hier- archy. Rather, these properties yield clusters of properties which tend to go together, exemplified by (28): (28) Linguistic properties that tend to go together (Dahl 2008: 142) Animate Inanimate Definite Indefinite Pronominal Lexical Subject Non-subject Count Mass Proper Common Rigid designation Non-rigid designation Independent reference Dependent reference Proximate Obviative Agent Non-agent The generalizations embodied in the different grammatical hierarchies pre- sented above, as well as in the list of properties in (28) have been modeled as constraints in recent work in OT, which reinterprets the grammatical hierar- chies as expressing the relative prominence of an element. Prominent elements on one hierarchy tend to attract prominent elements of another, hence subjects will tend to be animate, definite, agentive etc. This tendency has been modeled formally employing the technique of harmonic alignment of constraint hierar- chies (Aissen 1999, 2003).32 The generalizations mentioned in the preceding 32Harmonic alignment aligns the dominant elements of a scale with the dominant elements of another and the lower ranked elements of one scale with the lower ranked of another, ex- pressing the idea that prominence on one scale will attract prominence on another. Markedness constraints are derived by reversing the output scales from the alignment and adding the ‘avoid’- marker, ‘*’, to them: Suppose given a binary dimension D1 with a scale X > Y on its elements {X, Y}, and another dimension D2 with a scale a > b . . . > z on its elements. The harmonic alignment of D1 and D2 is the pair of harmony scales: 44 Linguistic dimensions of argument differentiation sections regarding properties of the subject, for instance, are thought to be a result of constraints which express the relative markedness of combinations of properties, such as the constraints on the animacy of subjects in (29):33 (29) SUBJ/HUM > SUBJ/ANIM > SUBJ/INAN The main idea is that these constraints interact with other constraints on argu- ment expression in various languages, for instance constraints on morphologi- cal marking or word order, but that their internal ranking is fixed. This predicts that inanimate subjects will cross-linguistically be more marked than animate subjects. As mentioned in section 2.1.1, the relation between typological markedness and frequency in language is obvious and has even lead to proposals reduc- ing this notion of markedness to frequency effects (Haspelmath 2006). As a consequence, the modeling of various factors in argument differentiation may employ constraints which are grounded directly in acquired frequency effects, such as the BIAS-constraint, first suggested by Zeevat and Jäger (2002) and employed by Jäger (2004); de Swart (2007), among others: BIAS: prefer the normal reading, the reading that is available in most cases The normal reading is thus the reading that is most likely according to the statistical tendencies described in (28). 3.6.2 A more general property As mentioned above, the interaction of the various linguistic properties in ar- gument differentiation has been attributed to their prominence in individual grammatical hierarchies.34 It is not immediately clear, however, what this no- tion of prominence actually entails. Without a clear definition of prominence, generalizations which make reference to this are at risk of simply restating the Hx: X/a > X/b > . . . > X/z Hy: Y/z > . . . > Y/b > Y/a The constraint alignment is the pair of constraint hierarchies: Cx: *X/z≫ . . .≫ *X/b≫ *X/a Cy: *Y/a≫ *Y/b≫ . . .≫ *Y/z (Prince and Smolensky 1993, as quoted in Aissen (2003: 441)) 33For illustrative purposes, we show only the harmonically aligned scales in (29), rather than the markedness constraints resulting from the reversal and negation of these scales. See Aissen 2003 for details. 34Formally, the interaction is modeled as the cross product of the constraint subhierarchies (Aissen 2003). 3.6 Interaction and generalization 45 question. There have, however, been several proposals attempting to reduce the effects observed in conjunction with argument differentiation to a more gen- eral linguistic property or even properties of our cognitive abilities or language processing in general. 3.6.2.1 Individuation The hierarchy in 26 above combines the categories of definiteness and ref- erentiality, highlighting different aspects of linguistic reference and may be explained by appealing to a notion of individuation: “the degree to which the interpretation of an NP involves a conception of an individuated entity” (Frau- rud 1996). Fraurud suggests that our cognitive ontology distinguishes between: • Individuals, e.g., Gabriel • Functionals, e.g., the postman • Instances, e.g., a glass of wine This ontology is at a more general level than the factors reviewed above but influences the choice of NP form. In particular, animate entities tend to be perceived as Individuals and inanimate as Instances. Individuals are the most individuated and are typically named, whereas Functionals on the other hand are conceived of in relation to some other entity in a part-whole relation and are typically definite. Instances, finally, are instantiations of general types and are thus the least individuated and they are typically indefinite ’type descriptions’. Fraurud distinguishes between two main types of knowledge which are nec- essary for the interpretation of NPs – type knowledge and token knowledge. Token knowledge is contextually determined and requires previous knowledge about the referent of an expression, whereas type knowledge relies on lexico- encyclopedic knowledge. Token knowledge is relevant only for the identifica- tion of Individuals, whereas for the other two types of ontological categories, Functionals and Instances, type knowledge is sufficient. This relates to our earlier discussion on animacy and nouns in section 3.2.4. Nouns typically ex- press Functionals or Instances which rely on type knowledge for interpretation, clearly related to the notion of being a denotational property. 3.6.2.2 Accessibility In experimental work on language production, cognitive status has been pro- posed as an explanation for the tendencies observed above. Branigan, Picker- ing and Tanaka (2008) appeal to a notion of conceptual accessibility: “[T]he 46 Linguistic dimensions of argument differentiation ease with which the mental representation of some potential referent can be activated in or retrieved from memory” (Bock and Warren 1985: as cited in Branigan, Pickering and Tanaka 2008), and distinguish between inherent and derived accessibility. Inherent accessibility is invariant across contexts and is a direct consequence of the number of conceptual relations an entity may par- take in, also known as its predictability. Animate entities are assumed to be more predictable than inanimate ones, hence have a high inherent accessibil- ity.35 The derived accessibility of an entity is temporary and context-specific, influenced by factors such as semantic priming and discourse status (Prat-Sala and Branigan 2000). Effects of animacy, both on word order and argument dif- ferentiation, are then explained with reference to conceptual accessibility; an animate entity is inherently accessible, and often also derived accessible by being definite and referential. Following from this generalization, the factors influencing argument dif- ferentiation can be seen to amount to conceptual accessibility. Conceptually accessible entities are thought to be retrieved first and assigned syntactic func- tion first. It is assumed that syntactic functions are assigned incrementally fol- lowing a hierarchy of grammatical functions, like the one in (8) above, a fact which accounts for animacy effects found on word order, as well as syntactic function assignment. 3.6.2.3 Agentivity In our earlier discussion on semantic roles in section 3.5, we noted that subjects often stand in an agentive relation to their predicate and animacy and agentivity are therefore strongly related. An important property of agents is their control over and sentience of an event (Dowty 1991). It has been argued that agentivity presupposes animacy (Hundt 2004). This depends somewhat on the notion of agentivity employed, however, and in particular on the treatment of causation. It is well known that many languages can have inanimate natural force subjects, e.g. the storm broke the window, and theories of thematic roles differ in the inclusion of these as agents proper.36 It is clear that there is no isomorphism between agentivity and animacy in general. As we have seen, animacy is an independent factor in a range of lin- 35The noted gradience of the animacy dimension is assumed to stem from gradience of pre- dictability – some animate entities are more predictable, e.g. humans than others, e.g. jellyfish (Branigan, Pickering and Tanaka 2008). 36In Fillmore’s case grammar (Fillmore 1968), for instance, inanimate causers are treated as agents. Yamamoto (1999), in contrast, operates with a notion of agenthood which does not include causers, following Dik (1989), hence is therefore more closely connected to animacy. 3.6 Interaction and generalization 47 guistic phenomena, where it seems unlikely that agentivity is the determining factor, e.g. number and case marking of objects. However, with respect to the strong correlation between animacy and subjects, agentivity is clearly an im- portant factor. Semantic roles are relational categories, just like syntactic func- tions, hence are closely related, but not overlapping. In this sense, explanation by means of agentivity adds a more semantic dimension to the generalization, but does not explain the correlations with definiteness and referentiality any further than the level of syntactic functions itself. Moreover, animacy provides a surface-oriented, lexical constraint on syntactic function assignment which embodies the semantic dimension of agentivity. This will prove to be important in the following. 4 PROPERTIES OFSCANDINAVIANMORPHOSYNTAX Languages differ in the way they encode grammatical functions. It has been noted that “morphology competes with syntax” (Bresnan 2001: 6) in that there is largely an inverse relationship between the extent of morphological marking and the degree of word order variation. Languages which encode arguments largely through morphological marking exhibit freer word orders, whereas languages which primarily employ structural positions to encode grammati- cal functions, so-called configurational languages, necessarily exhibit limited word order variation. Most languages, however, are somewhere in between these two extremes with respect to the balance between morphology and syntax in argument encoding. For instance, the Scandinavian languages have limited morphological marking of syntactic functions, but allow for variation in word order which makes for an interesting comparison with more configurational languages, like English. In this chapter we examine some relevant characteristics of the Scandina- vian languages. Particular focus will be on various properties of syntactic argu- ments and on variation in their categorial, morphological and structural encod- ing. For more detailed overviews of the Scandinavian languages see, e.g., the Norwegian reference grammar (Faarlund, Lie and Vannebo 1997), the Swedish reference grammar (Teleman, Hellberg and Andersson 1999) or an English overview in Holmes and Hinchliffe 2003. 4.1 Morphological marking The distinction between various types of arguments is partially encoded through case marking in Scandinavian. Nominal arguments are furthermore inflected for other categories, such as definiteness. 50 Properties of Scandinavian morphosyntax 4.1.1 Case Morphological case explicitly encodes grammatical function. The Scandina- vian languages make limited use of case marking, and, in this respect, resem- ble English. Pronouns are marked for case, but exhibit syncretism and syntac- tic variation, whereas nouns distinguish only genitive case and are otherwise invariant for case. 4.1.1.1 Core arguments Personal pronouns distinguish nominative, accusative and genitive case.37 In the Swedish examples in (30)–(32), we see that case unambiguously signals syntactic function for the first person pronoun jag/mig ‘I/me’. It is a subject in (30) and a direct object in (32)). Assignment of syntactic function takes place irrespective of the position of this argument, which is preverbal in the case of (30) and postverbal in (31) and (32). (30) Jag I-NOM SUBJ såg saw den it-Ø OBJ ‘I saw it’ (31) Den it-Ø OBJ såg saw jag I-NOM SUBJ ‘It, I saw’ (32) Den it-Ø SUBJ såg saw mig me-ACC OBJ ‘It saw me’ 37In the following we will adhere to a rather liberal definition of pronouns, following Tele- man, Hellberg and Andersson 1999, which includes: 1. definite pronouns - personal, e.g. han ‘he’, honom ‘him’, demonstrative, e.g. denna ‘this’, reflexive e.g. sig ‘him/her/itself’, reciprocal pronouns e.g. varandra ‘each other’ 2. interrogative pronouns, e.g. vem ‘who’, vilken ‘which’ 3. quantifying pronouns, alla ‘all’, någon ‘some’ 4. relational pronouns - comparative, samma ‘same’, ordinal, första ‘first’ 4.1 Morphological marking 51 Case marking is not, however, always unambiguously indicative of syntactic function. For instance, in both Swedish and Norwegian, the third person sin- gular pronouns det, den ‘it’ have the same form for nominative and accusative case. Quantifying pronouns, like alla ‘all’, många ‘many’ are also invariant for case. In the examples in (33)–(34), in contrast to (30)–(32) above, case does not indicate syntactic functions for the pronominal argument it ‘den’ and the proper noun Lisa. (33) Lisa Lisa-Ø såg saw den it-Ø (34) Den it-Ø såg saw Lisa Lisa-Ø In Norwegian, nominative form is preferred when a pronoun is stressed, re- gardless of syntactic function (Johannessen 1998). So, when followed by for instance a relative clause, the pronoun will be in its nominative form even when functioning as an object.38 In example (35) we see a nominative pro- noun functioning as object, whereas the same pronoun functions as subject in (36):39 (35) Dette this gjelder concerns i in tillegg addition de they-NOM som who håndterer handle . . . . . . ‘This also concerns those who handle . . . ’ (36) De they-NOM som who fortsatt think tror that at the idyllen idyll kan can bevares maintain-PASS [. . . ] [. . . ] tar take alvorlig seriously feil wrong ‘Those who still believe that the idyll can be maintained [. . . ] are seriously mistaken’ The same tendency for the plural 3rd person with relative clause modification de som . . . ‘they who’ has been noted in written Swedish as well (Teleman, Hellberg and Andersson 1999: vol. 2, 299). In Swedish spoken language and casual writing, the 3rd person plural pronoun is realized as dom ‘they/them’ in both subject and object function. 38Note that the nominative form is employed also when the pronoun is modified by a preposi- tional postnominal modifier, so the preference for nominative case is not due to the argument’s subject status in the relative clause. 39The examples in (35)–(36) are taken from the Norwegian Oslo Corpus, see section 5.1.3. 52 Properties of Scandinavian morphosyntax 4.1.1.2 Arguments and determiners Genitive case signals a nominal’s status as determiner. Definite and personal pronouns distinguish genitive case formally, e.g. min, hans, deras ‘my, his, their’, whereas proper and common nouns do not distinguish nominative and accusative case, but may be marked for genitive case with the suffix -s, e.g., Gabriels ‘Gabriel’s’, doktorandens ‘the PhD-student’s’.40 We may furthermore distinguish between nominal and attributive pronouns, where nominal pronouns are pronouns which function as independent argu- ments, and the attributive pronouns are determiners. However, this distinction is blurred by the fact that many pronouns may function as both, and are then formally identical. This is true for most of the pronouns which do not distin- guish case, see section 4.1.1.1 above. For instance, the pronoun den ‘it’ may function as a subject, as in (37), and as a definite determiner, as in (38), where it modifies a common noun:41 (37) Sedan later somnar sleeps den it ‘Later, it falls asleep’ (38) Den the vetenskap science som which sysslar deals med with dessa these kallas call-PASS psykiatri psychiatry ‘The science which deals with these matters is called psychiatry’ The neuter form det ‘it’ exhibits the same functional variation and in addition may also function as expletive subject and object.42 4.1.2 Definiteness In section 3.3 we examined semantic definiteness and discussed criteria for definiteness, including identifiability and more discourse-pragmatic notions re- lated to the cognitive status of an element. The Scandinavian languages mark 40There are alternative genitive constructions which are remniscient of the genitive alterna- tion found e.g. in English. For instance, in Norwegian jentas bror ‘the girl’s brother’ and broren til jenta ‘the brother of the girl’. The alternation is less common in standard Swedish which typ- ically uses the genitive suffix in these cases. In both languages, however, there is the possibility of expressing part-whole relations in terms of prepositional modification with the preposition på ‘on’: taket på huset ‘the roof on the house’. We will have more to say about differential properties of genitive constructions in chapter 6 and 9. 41The examples in (37)–(38) are taken from the Swedish treebank Talbanken05, see section 5.1.1. 42See section 8.3.1 and examples therein for a corpus study of the different argument relations in Swedish, including formal subjects. 4.2 Word order 53 definiteness morphologically, but formal definiteness is not completely iso- morphic with semantic definiteness. Nouns are marked for definiteness by a definite suffix, e.g. bil-en ‘car-DEF’, hus-et ‘house-DEF’.43 There is agreement for definiteness within the noun phrase, governed by the nominal head: (39) det the gamla old-DEF året year-DEF ‘the old year’ The definite suffix is not, however, necessary, nor sufficient for semantic defi- niteness. Noun phrases may be semantically definite without definite marking on the noun when rendered definite by properties of the construction, e.g., by a definite determiner. For instance, genitive determiners, as in (40), some def- inite determiners, as in (41), as well as the universal quantifier may combine with an indefinite noun to form a semantically definite noun phrase:44 (40) Gabriels Gabriel-GEN bil car ‘Gabriel’s car’ (41) Den the bil car som which Gabriel Gabriel äger owns ‘the car that Gabriel owns’ There are also nouns with definite marking which are not semantically definite. In particular, definite nouns may be employed with generic reference, to refer to instances as a type, rather than a particular instance: (42) Lejonet lion-DEF är is Afrikas Africa-GEN största largest köttätare carnivore ‘The lion is Africa’s largest carnivore’ 4.2 Word order The classical descriptive model for Scandinavian word order is based around organization into so-called topological fields (Diderichsen 1957). The topolog- ical fields approach separates the clause into, roughly speaking, three parts: the 43The particular definite suffix is determined by the gender of the noun. 44Scandinavian noun phrases and definiteness is an intriguing subject which we will not aim to cover in the current context. For instance, nouns with definite marking may occur as a bare noun phrase, e.g, bilen ‘car-DEF’, but may also be specified by a definite determiner, exhibiting so-called “double definiteness”, e.g. den bilen ‘that car-DEF’. See Börjars 1998 for an in-depth study and analysis of Scandinavian noun phrases. 54 Properties of Scandinavian morphosyntax initial field, the mid field and the end field (Teleman, Hellberg and Andersson 1999): (43) Initial Mid End MAIN I morgon kan hon inte vara med vid sammanträdet tomorrow can she not be with at meeting-DEF SUBORD eftersom hon inte kan vara med vid sammanträdet since she not can be with at meeting-DEF Note that the topological fields are not constituents in the phrase structural sense, and do not pass standard constituency tests such as topicalization. For syntactic theories which propose a separation between functional structure and linearization,45 however, topological fields provide a natural extension for ex- pressing linearization in Germanic languages.46 The separation into topologi- cal fields enables generalization over the word order patterns in the Scandina- vian languages, capturing some key properties regarding the positioning of the verb, as well as positioning of arguments and adverbials across various clause types. Some relevant properties of Scandinavian syntactic structure captured in the fields approach are summarized below. 4.2.1 Initial variation The initial position is characterized by a great deal of variation. It has been claimed to mark the syntactic-semantic type of the clause and is closely re- lated to the speech act expressed by the clause (Platzack 1987). Moreover, the initial constituent is often topical, in the sense that it links the sentence to the preceding context.47 Most clausal constituents may occupy initial position in declarative main clauses, e.g., subjects (44), direct objects (45) and adverbials (46). Constituent questions contain a wh-word in initial position, as in (47). (44) Statsministern primeminister-DEF håller holds talet speech-DEF i in morgon tomorrow ‘The primeminister gives the speech tomorrow’ 45This is true of LFG (Bresnan 2001), most flavours of dependency grammar (Sgall, Hajico˘vá and Panevová 1986; Mel’c˘uk 1988; Hudson 1990), as well as some versions of HPSG, e.g., Pollard and Sag 1994. 46See Ahrenberg (1990) for an early formalization employing regular expressions over a constituent-based analysis constituting a separate level (t-structure) within an LFG grammar, and Bröker 1998 for an implementation of a dependency grammar with ordering by topological fields introduced as so-called metacategories. 47The realization of an argument in initial position is referred to as topicalization and is thought of as movement to clause-initial position in transformational theories. 4.2 Word order 55 (45) Talet speech-DEF håller holds statsministern primeminister-DEF i in morgon tomorrow ‘The speech, the primeminister gives tomorrow’ (46) I in morgon tomorrow håller holds statsministern primeminister-DEF talet speech-DEF ‘Tomorrow, the primeminister gives the speech’ (47) När when håller holds statsministern primeminister-DEF talet? speech-DEF ‘When does the primeminister give the speech?’ The initial position may also be empty. Imperative clauses and yes/no-questions are verb-initial in Scandinavian, cf. (48)–(49). (48) Håll hold talet speech-DEF i in morgon! tomorrow ‘Give the speech tomorrow!’ (49) Håller holds statsministern primeminister-DEF talet speech-DEF i in morgon? tomorrow ‘Does the prime minister give the speech tomorrow?’ 4.2.2 Rigid verb placement Like the majority of Germanic languages, but unlike English, the Scandinavian languages are verb second (V2); the finite verb is the second constituent in declarative main clauses, see (44)–(47) above. Subordinate clauses, however, are not V2: (50) . . . eftersom since statsministern primeminister nog probably inte not håller holds talet speech-DEF i in morgon tomorrow ‘. . . since the prime minister probably will not give the speech tomorrow’ Non-finite verbs follow the finite verb, but precede their complements.48 Nei- ther of the elements in the end field, i.e., the non-finite verb, followed by vari- ous objects and adverbials, are obligatory. In fact, the only obligatory element 48In this respect Scandinavian differs from German, which positions non-finite verbs in clause final position. 56 Properties of Scandinavian morphosyntax in the clause is the finite verb, and, with a few exceptions, the subject. How- ever, the presence of a non-finite verb introduces a greater rigidity in terms of positioning and interpretation of the clausal constituents.49 With respect to arguments, only subjects may intervene between a finite and non-finite verb, as in (52), and, as mentioned already, only objects may follow the non-finite verb, as in (51): (51) Statsministern primeminister-DEF ska hålla shall talet hold speech-DEF ‘The primeminister will give the speech’ (52) Talet speech-DEF ska shall statsministern primeminister hålla hold ‘The speech, the primeminister will give’ Main clauses consisting of a finite, transitive verb along with its arguments are structurally ambiguous, as in (53), whereas the placement of a non-finite verb in the same clause clearly indicates syntactic functions, as in (54)–(55): (53) Vem who såg saw Ida? Ida ‘Who saw Ida / Who did Ida see?’ (54) Vem who SUBJ har has sett seen Ida? Ida OBJ ‘Who has seen Ida?’ (55) Vem who OBJ har has Ida Ida SUBJ sett? seen ‘Who has Ida seen?’ These rigid placement constraints extend also to particles and prepositional modifiers of the finite verb: (56) Vem who SUBJ kom came ihåg in-memory Ida? Ida OBJ ‘Who remembered Ida?’ 49This fact has been taken as an indication that Swedish exhibits evidence for a VP only in clauses with a non-finite lexical verb (Andréasson 2007). Such an analysis clearly questions the status of Scandinavian as a strictly configurational language. 4.2 Word order 57 (57) Vem who OBJ kom came Ida Ida SUBJ ihåg? in-memory ‘Who did Ida remember?’ 4.2.3 Variable argument placement The generalization that most constituents may occupy sentence-initial position entails that they have two alternative positions – initial position and a non- initial position. A schematized version of the predictions of the fields analysis with respect to the linearization of verbs and (non-initial) arguments in main clauses is provided in (58) below (Engdahl, Andréasson and Börjars 2004):50 (58) Linearization of grammatical functions in declarative, main clauses: XP Vfin SUBJ S-ADV Vnon−fin OBJind OBJdir ADV The subject, for instance, may occupy either the initial position or the position immediately following the verb. Note that the fields analysis does not capture the generalization that the subject is the most common initial constituent. The basic word order of a language is “typically identified with the order that occurs in stylistically neutral, independent, indicative clauses [. . .], it is the ordering of constituents in prototypical transitive clauses”. (Siewierska 1988: 8). In this respect, the Scandinavian languages must be said to be SVO languages. Subordinate clauses differ from the schema in (58) in that they have a dif- ferent ordering of the arguments with respect to the finite verb in the mid field: (59) Linearization of grammatical functions in subordinate clauses subj SUBJ S-ADV Vfin Vnon−fin OBJind OBJdir ADV A uniform analysis of main- and subordinate clauses has been proposed under the assumption that the subjunction and the finite verb are instances of the same category (C0), which expresses the finiteness of the clause (Platzack 1987). 50Note the similarity with the hierarchy of grammatical functions presented in section 3.1. There we made a distinction between primary and secondary objects, following Bresnan 2001. The more traditional terms ‘direct’ and ‘indirect’ object will however be employed in the fol- lowing. As we remember, primary objects denote indirect objects and objects of monotransitive verbs and secondary objects denote the direct objects of ditransitive verbs. On this mapping, the ordering in the schema in 58 corresponds directly to the one proposed in the hierarchy. Note however, that the original hierarchy in Keenan and Comrie 1977 propose the reverse ordering (OBJ" "brev" noun common sing def neuter @obj @subj "" "med" preposition @adv "" "det" determiner demonstrative sing neuter @det> "" "pussig" adjective sing def @adj> "" "innhold" noun common sing def neuter @" "skrive" verb past tr1 i1 tr11 pa1 d1 pa5 pa3 @fv "" "jente" noun common sing def fem @obj @subj As we can see, only the noun of a noun phrase receives the syntactic function tags @subj and/or @obj, whereas other modifying elements in the phrase will receive modifier-tags, relating them to the noun. Unlike the dependency analysis of Talbanken05, see section 5.1.1, the dependency relation is under- specified with respect to which element is head of a syntactic label or tag. It 61The Oslo Corpus is available for research purposes, see http://www.hf.uio.no/tekstlab 5.2 Machine Learning 67 does not, for instance, make explicit that the subjects and object are dependents of the main (and only) verb. Notice also that the subject and object have not been disambiguated, both readings are still present in the output. The above ex- ample illustrates a containment of ambiguity which follows directly from the eliminative approach of Constraint Grammar. Since the rules have not been able to remove all but one analysis, both remain in the output. 5.2 Machine Learning This section presents the machine learning algorithms and software which are employed in Part II of this thesis dealing with lexical acquisition of animacy information for common nouns in Scandinavian. Machine learning may be defined as follows (Mitchell 1997: 2): Definition: A computer program is said to learn from experience E with re- spect to some class of tasks T and performance measure P, if its perfor- mance at tasks in T , as measured by P, improves with experience E Properties of the training experience give rise to the distinction between super- vised and unsupervised learning, where the former is characterized by direct evidence and the latter by indirect evidence. In supervised learning, the train- ing experience consists of input-output pairs, whereas unsupervised learning involves learning without output values. The input to learning is commonly represented as a feature vector, a tuple of features with corresponding values 〈 f1 = v1, . . . , fn = vn〉 which defines an n-dimensional vector or feature space. We will primarily be concerned with supervised learning in the following and present two supervised machine learning systems, C5.0 presented in section 5.2.1, and TiMBL presented in section 5.2.2, based on decision-tree learning and memory-based learning, respectively. However, we will also employ the unsupervised technique of clustering as a method for data exploration in sec- tion 6.7.1 and we briefly present the clustering software Cluto in section 5.2.3. Machine learning is based on inductive reasoning, hence improvement of the performance measure is usually defined through generalization to unseen instances. Most machine learning tasks may furthermore be reduced to learn- ing a target function and therefore rely on an algorithm for locating the func- tion that best fits the training data. The way in which the search for the best hypothesis is performed is part of the inductive bias of the machine learn- ing algorithm. We experiment with two quite different machine learning al- gorithms instantiating the general distinction between eager and lazy learn- ing algorithms. Eager learning algorithms generalize over the data prior to the application to unseen instances, whereas lazy algorithms postpone generaliza- 68 Resources tion until the application to a new instance. The main difference between the two is thus found in the fact that lazy algorithms may consider the unseen in- stance when deciding how to generalize, whereas the eager algorithm may not (Mitchell 1997: 244f). The c4.5-algorithm employed for decision-tree learning is an eager algorithm, whereas the k-nearest neighbor algorithm employed in memory-based learning, is a lazy learning algorithm. 5.2.1 Decision trees (C5.0) A decision tree is a classification model which relates a set of predefined classes with properties of the instances to be classified. Classification using a decision tree proceeds by means of a set of weighted, disjunctive tests which at each step, or node, in the decision tree assigns an appropriate test to an input, and which proceeds along one of its branches, representing possible outcomes of the test. The software package employed for decision tree learning is C5.0 (Quinlan 1993).62 Decision trees may be learned inductively by examining a set of train- ing data and based on properties of these, constructing a classification tree. An initial tree is constructed from the training data by means of a splitting criterion and a stopping criterion (Manning and Schütze 1999). The splitting criterion grows the tree by dividing the training data into increasingly smaller subsets, whereas the stopping criterion tells the learner when to stop splitting. Follow- ing the c4.5 algorithm for decision-tree learning (Quinlan 1993), the splitting of a training set T into subsets Ti, ..,Tn in accordance with a test X with n out- comes is determined by a measure of information gain, i.e., the information gained by applying the test X to the training data T . The information gain of a particular test X is the difference between the amount of information needed to identify the class of a case in T on average and the information gained by partitioning the data in accordance with a particular test X . gain(X) = info(T )− infoX(T ) The first term (info(T )) is obtained by summing over the information resulting from choosing each class C j, ...,Ck, weighted by the frequency of the class in the training set T :63 info(T ) =− k ∑ j=1 freq(C j,T ) |T | × log2 ( freq(C j,T ) |T | ) 62The C5.0 software package may be downloaded from http://www.rulequest.com/. 63On a more general note infoX) = H(X), the entropy for a single random variable. 5.2 Machine Learning 69 The information measure (infoX(T )) for a certain test X which partitions the training data into n subsets, is obtained by summing over the information con- tained within each subset, as weighted by the frequency of the subset cases in the training set as a whole: infoX(T ) = n ∑ i=1 |Ti| |T | × info(Ti) At each node in the decision tree, the test is chosen which maximises the in- formation gain.64 The splitting process is terminated when (i) all the subsets contain cases of the same class, or (ii) no further tests improve the results fur- ther. The decision tree resulting from the initial splitting phase usually has the disadvantage of overfitting the data, i.e. it places too much significance on the observations made in the training data and may induce generalizations from mere coincidental properties of these. As a second stage in constructing a de- cision tree, a stage of pruning is vital to performance on unseen test cases. In the C5.0 system, the pruning of a decision tree is based on the predicted error rate of all the subtrees (Quinlan 1993).65 5.2.2 Memory-Based Learning (TiMBL) Memory-Based Learning (Daelemans 1999; Daelemans and van den Bosch 2005) is a machine learning approach which is characterized by a notion of analogy, rather than abstraction. Training instances which constitute the train- ing experience are simply stored in memory. At classification, some defini- tion of similarity is employed in order to locate the instance(s) most similar to the new, unseen instance to be classified. Classification of the new instance is based directly on the knowledge of the previous assignment to these similar examples and learning is therefore supervised. Memory-based learning fur- thermore employs a lazy learning algorithm, the k-nearest neighbor algorithm, which postpones learning until classification time. In our experiments we make use of the Tilburg Memory-Based Learner (TiMBL) (Daelemans et al. 2004).66 TiMBL has a range of parameters which may be set to affect the learning process in various ways. The most important 64In fact, the measure of information gain has the disadvantage of favoring tests with many outcomes, with a worse-case scenario of one case per leaf node. Therefore the C5.0 system employs a refined measure of information gain - the gain ratio (Quinlan 1993: 23) 65The predicted error rate is calculated directly from the training data and thus requires no held-out data for pruning. This is a clear advantage when data are sparse. 66TiMBL is freely available at http://ilk.uvt.nl/software.html 70 Resources of these parameters relate to either the selection of the nearest neighbors, i.e. which instances in memory should be allowed to affect the classification of a new instance, or the influence which each of these neighbors may exert over the final classification. With regard to locating the nearest neighbors, various similarity metrics may be employed, which calculate the distance in vector space between the in- stance to be classified and various candidate neighbors. With regard to distance metrics, the default setting is the Overlap metric, where the distance is calcu- lated as the sum of differing values of features. The k option allows the user to specify the region within which the k-nearest neighbors are found, where k is the number of distances considered (Daelemans et al. 2004). The parameter settings for feature weighting provide methods for assigning differing impor- tance to individual features. One may also choose to give all features equal weight. The Information Gain setting takes the informativity of each feature into account when assigning weights, and the Gain Ratio (default) setting in addition normalizes for the number of values each feature may take on, based on the training data. Finally in classification, the k-nearest neighbors determine the class of the unseen instance. The manner in which this decision is made can be influenced by the class voting weights. Either all neighbors have equal weight and the ma- jority determines the class, so-called “majority voting”, or the votes of closer neighbors are given more importance than more distant ones. 5.2.3 Clustering (Cluto) Clustering is one of the primary methods of unsupervised machine learning where elements are grouped together based on their level of similarity. It is an unsupervised method since there is no use of manually annotated training data. Similarity is usually defined by distance in a high-dimensional vector space. The clustering experiments presented in section 6.7 are performed using the Cluto clustering software (Karypis 2002), which is freely available.67 Cluster- ing algorithms are commonly classified as either bottom-up, so-called agglom- erative clustering, or top-down, also known as partitive or divisive clustering. Cluto supports a range of different clustering algorithms of both types, as well as a range of parameters specifying a criterion function. The criterion function is defined over the instances of the clusters and provides a value expressing the level of similarity within the set of clusters, between the individual clusters or a combination of these. The main goal of clustering can then be reduced to locating the cluster solution which optimizes a certain criterion function. 67The Cluto software may be obtained from http://glaros.dtc.umn.edu/gkhome/views/cluto 5.3 Parsing 71 A key problem in automatic clustering resides in locating the optimal num- ber of clusters given a data set and no further information regarding the num- ber of assumed categories. In Cluto this problem is bypassed by requiring the user to define the desired number of clusters in a clustering solution with the k-parameter. Section 6.7 provides more detail on the specific algorithm, crite- rion function and k-values employed, as well as discussing evaluation of the obtained cluster solutions. 5.3 Parsing In Part III we present experiments in data-driven dependency parsing. We pri- marily employ the MaltParser system, however, a contrastive study is also performed with the MSTParser system. An introduction to data-driven depen- dency parsing is provided in section 8.1, as well as a more detailed introduction to the MaltParser system. 5.3.1 MaltParser The freely available MaltParser68 is a language-independent system for data- driven dependency parsing. MaltParser is based on a deterministic parsing strategy, first proposed by Nivre (2003) and extended to labeled dependency graphs in Nivre, Hall and Nilsson 2004, in combination with treebank-induced classifiers for predicting the next parsing action. See section 8.1.3 for more detail. 5.3.2 MSTParser MSTParser is freely available69 and is a language-independent system for data- driven dependency parsing. It searches for the maximum spanning tree over directed graphs, and employs large-margin discriminative training for the in- duction of scoring functions (McDonald, Crammer and Pereira 2005; Mc- Donald et al. 2005). See section 9.3.1.2 for more detail and a comparison with MaltParser. 68http://w3.msi.vxu.se/users/nivre/research/MaltParser.html 69MSTparser is freely available from http://mstparser.sourceforge.net Part II Lexical Acquisition 6 ACQUIRING ANIMACY –EXPERIMENTALEXPLORATION Lexical acquisition deals with the automatic induction of lexical information. This area of computational linguistics has gained an increasingly important role as large linguistic corpora have become more easily available. The meth- ods employed are per definition data-driven and rely on statistical inference over language data in some form. Whereas lexical information can refer to a wide variety of properties, most recent work has focused on the acquisition of lexical semantics. The basic approaches can be summarized as follows (Bald- win 2006): Lexical similarity identify “near-matches” (synonyms, near-synonyms, asso- ciated word etc.) to the given lexical item, and “inherit” their semantic properties Lexico-syntactic patterns identify the lexico-syntactic patterns associated with a given phenomenon, and look for corpus occurrences thereof Resource mining mine pre-existing lexical resource(s) for relevant informa- tion The view of lexical semantics which underlies the two first approaches to lex- ical acquisition is what is often referred to as the distributional hypothesis, stating that words which have similar distributions in language will also have similar meanings. The context of usage is thus defining for the meaning of lexical items and the context is usually represented as a high-dimensional vec- tor. The way in which the context of usage (or the dimensions of the vector) is defined, varies, ranging from the use of simple word forms (Schütze 1998; Sahlgren 2006), lemmas, parts-of-speech to syntactic relations (Lin 1998). The second approach which relies on lexico-syntactic patterns in addition assumes that the syntactic distribution of lexical items constitutes a reliable predictor of semantics or meaning. There are, generally speaking, two main methods for the evaluation of the acquired information in lexical acquisition – internal and external evaluation. 76 Acquiring animacy – experimental exploration In an internal evaluation the acquired information is evaluated against a gold standard of some kind, e.g. a manually annotated corpus or a lexical resource (thesaurus, ontology etc.). An external evaluation evaluates the effect of the added information in terms of performance on a separate NLP task, such as parsing or semantic role labeling. In chapter 3, we discussed the lexical semantic property of animacy and its relation to the realization and interpretation of syntactic arguments. We saw that animacy is a central factor in argument differentiation since arguments differ in their degree of correlation with the dimension of animacy, leading to observable frequency effects in a range of languages. Explanations for these correlations appeal to notions of accessibility, cognitive status and prominence, see section 3.6. We also distinguished animacy as a denotational property from a referential property, claiming that animacy is largely a denotational property of nouns. This chapter aims at investigating the theoretical claims from chapter 3 fur- ther.70 In particular, we argue that the task of animacy classification, a subtask of the general problem of lexical acquisition of animacy information, provides us with a methodology for evaluating the theoretical proposals. We employ data-driven methods which highlight the correlation between syntax and se- mantics and enable us to quantify the strength of this association. In particular, we investigate the extent to which the syntactic distribution of a noun is indica- tive of its animacy, and to what extent it is possible to generalize from syntactic behaviour to semantic animacy for unseen nouns. The assumption that animacy is a denotational property of nouns entails that the animacy of a noun is fairly stable across different contexts. We may consequently test whether animacy can be obtained as a class property at the level of lemmas or types, i.e. whether animacy may be acquired based on information regarding lemmas. Further- more, the dimension of animacy and its delimitation and gradience may be tested by examining different types of nouns. To be more precise, we will investigate the following issues: Viability Can animacy be acquired through morphosyntactic distributional features for noun lemmas? Features Can we approximate the linguistic generalizations regarding ani- macy as empirical features taken from a corpus? Which features may be employed to acquire animacy? Which of these features are most im- portant? Generalizability – robustess How is the classification affected by sparse data? How much data is needed to acquire stable generalizations? 70Parts of this chapter builds on work presented earlier in Øvrelid 2005, 2006. 6.1 Previous work 77 Generalizability – machine learning algorithm Does the choice of machine learning algorithm affect the results? Are the results generalizable across machine learning algorithms with different properties? We examine the following distinctions, see section 5.2 for more detail: 1. eager versus lazy learning algorithms 2. supervised versus unsupervised learning Class granularity Do the results say anything about the delimitation of the animacy dimension and its division into classes? Can we test the hy- potheses regarding gradience in a principled manner? 6.1 Previous work 6.1.1 Animacy Lexical acquisition of animacy information constitutes a relatively unexplored field of study in computational linguistics. It bears some resemblance to the task of named entity recognition (NER) (Tjong Kim Sang 2002b) which usu- ally makes reference to a ‘person’ class, (see, e.g., Chinchor et al. 1999). However, whereas most NER systems make extensive use of orthographical, morphological or contextual clues (titles, suffixes) and gazetteers, animacy for nouns is not usually signaled overtly in the same way. The work which has been done on acquisition of animacy information falls into the third category of lexical acquisition tasks mentioned above, namely that of ‘resource min- ing’. The lexical resource employed is that of the English WordNet (Fellbaum 1998). WordNet can be seen to represent the animacy distinction by having very general hypernyms (so-called ‘unique beginners’), such as ‘person’ and ‘artifact’, and the idea is that the hyponyms of these general concepts inherit their animacy. However, direct extraction from WordNet is not completely triv- ial; the animacy of the unique beginners do not unequivocally distribute to their hyponyms and WordNet, as is well-known, contains extensive polysemy, i.e. words may belong to several senses. Ora˘san and Evans (2001, 2007) present a study where animacy information is inherited from hypernyms (starting with unique beginners) in WordNet and where polysemy is distributed evenly across the various senses. A threshold for animacy is then determined empirically. An extended approach in addition makes use of a few contextual clues in the an- notation of a corpus for animacy.71 Ora˘san and Evans (2007) show that the 71Ora˘san and Evans (2007) employ machine learning in order to annotate the animacy of a given noun in a corpus. In addition to the dominant animacy of the noun, taken from WordNet, 78 Acquiring animacy – experimental exploration acquired animacy information can be beneficial for anaphora resolution, hence evaluate the classification externally. It is clear from the above that the methods employed in previous work on acquisition of animacy information are restricted to languages for which large scale lexical resources expressing this distinction, such as WordNet, are avail- able. 6.1.2 Verb frames and classes Whereas acquisition of information regarding nouns in general has not been given much attention in work on lexical acquisition, various syntactic and se- mantic properties of verbs have been widely studied.72 In particular, the induc- tion of various types of verb frames and verb classes has been the subject of a range of studies. The subcategorization frame of a verb describes the types of arguments a verb takes and is usually assumed to be a gradient property in computational work, hence describes the differing propensities of verbs for different syntactic frames or sets of arguments. Lexical acquisition of subcategorization informa- tion (Manning 2003; Carroll and Rooth 1998) usually relies on a parsed corpus and differ with respect to the number of frames and the information (parts-of- speech, lexicalization) provided for the frames (Schulte im Walde 2007). Acquisition of selectional restrictions (Resnik 1996; Erk 2007), see sec- tions 3.5 and 9.2.8, can be seen as an extension of subcategorization frame induction, where the syntactic classes of arguments are specified for seman- tic class.73 This information is usually taken from a lexical resource, typically WordNet, however, lexical similarity has also been employed in order to gen- eralize frames to unknown instances (Erk 2007). Finally, induction of verbal, semantic classes has been extensively stud- ied from the perspective of lexical acquisition. The availability of a lexicon of verb classes (Levin 1993) has enabled a common platform for evaluation. It is they make use of the “animacy of the verb” if the noun in question is subject and the proportion of animate/inanimate pronouns in the text. It is unclear, however, how the former feature is calculated – whether the information is taken from the WordNet resource in some way or from the gold standard corpus. The latter feature provides a measure of the frequency of animate entities on a whole for the text in question. 72See Schulte im Walde 2007 for a comprehensive overview of work on acquisition of verbal frames and classes. 73Selectional restrictions as an extension of subcategorization is an oversimplification, due to the fact that these provide information from different linguistic levels. The latter is strictly syntactic, whereas the former is semantic. It is therefore in principle possible to separate the two distinctions, i.e. the selectional restrictions need not necessarily entail syntactic subcatego- rization. More on this in section 9.2.8 below. 6.2 Data preliminaries 79 an underlying assumption in the work on acquisition of verb classes that the syntactic distribution of a verb, in particular with respect to so-called alterna- tions, is largely determined by its semantic class.74 This assumption makes it possible to acquire semantic classes largely based on syntactic, distributional features. Merlo and Stevenson (2001) study the acquisition of verb class in- formation as a classification problem, focusing on three classes of optionally intransitive verbs in English - unergative verbs, e.g., race, unaccusative verbs, e.g., melt and object-drop verbs, e.g., play. They make use of a small set of lin- guistically motivated features in a 3-way classification task and show a consid- erable improvement over a random baseline (69.8% accuracy with a baseline of 33.9%). Joanis and Stevenson (2003) extend this work to a larger feature set with comparable performance. The verb classification in Levin 1993, how- ever, only provides one out of many possible ways of classifying verbs and it is also only available for English. Unsupervised approaches to acquisition of verb classes partially address this problem by clustering verbs without the necessity of a resource like Levin for English (Stevenson and Joanis 2003) and for languages where such resources are not available, e.g. German (Schulte im Walde 2006).75 The close relation between syntax and semantics which is highlighted in particular in the work on verb classes bears a strong resemblance to our prob- lem of animacy classification, as defined above, and we will use it as inspira- tion in the following. 6.2 Data preliminaries In the following we formulate the task of acquiring animacy information as a classification problem where the learning task consists of classifying nouns as being either animate or inanimate. The goal of learning is thus to find the approximated target function, V , which best performs this task. That is, given a set of noun lemmas, we want to locate the function which gives as output the greatest number of correctly assigned classification values. This is a discrete- valued function from noun instances to a class from a predefined set of classes, V : NounLemma→ c ∈ {Anim, Inan}. In order to train a classifier to distinguish between animate and inanimate 74A syntactic alternation is a variation in terms of the syntactic realization of arguments with respect to a particular verb. Examples are the dative alternation, as in examples (10)–(11) in section 3.2.1 and spray/load alternations, e.g. spray the wall with paint vs spray paint onto the wall . See Levin 1993 for an extensive overview. 75Note, however, that a gold standard is still assumed for evaluation purposes, although it is not employed for training. 80 Acquiring animacy – experimental exploration nouns, we have to decide on the appropriate training experience for this task. In particular, we must select a set of representative instances and decide on a relevant representation of these instances. Section 6.2.1 discusses the choice of language and corpus resource for the present classification study, and 6.2.2 will detail the selection of nouns for classification. In section 3.2 above we discussed the role of animacy in argument differ- entiation as a cross-linguistic tendency which exhibits clear frequency effects in language. We also discussed linguistic dimensions closely related to an- imacy and argument differentiation, such as agentivity, individuation and ac- cessibility, see section 3.6.2. In section 6.2.3 we formulate a set of theoretically motivated features which exploit the close relation between animacy and dis- tinctions in syntactic argumenthood, as well as related notions. It is important to note that these features only provide practical approximations of more theo- retical notions of argumenthood, agentivity and accessibility. These theoretical notions are approximated by emprical features which may be extracted from an automatically annotated corpus. 6.2.1 Language and corpus resource All experiments in this chapter are performed on Norwegian. In chapter 7, we will investigate the application of the methods developed in this chapter to another Scandinavian language, namely Swedish. Since we wish to employ the morphosyntactic distribution of a noun as an indicator of its animacy, we need a corpus with morphological and syn- tactic annotation. Also, the corpus should be as large as possible, since we rely on inductive inference from frequencies in language use for classification. For the extraction of morphosyntactic distributional information, we choose to employ the Oslo Corpus, a corpus of Norwegian texts of approximately 18.5 million words. The corpus is morphosyntactically annotated and assigns an un- derspecified dependency-style analysis to each sentence, see section 5.1.3 for more detail.76 The containment of ambiguity which is a property of Constraint Grammar can be seen to be an advantage in the approximation of features, since it enables the exclusion of instances which were deemed ambiguous by the grammar. 76All examples in the current chapter are taken from the Oslo Corpus, unless otherwise stated. 6.2 Data preliminaries 81 Animate Inanimate barn ‘child’, direktør ‘director’, far ‘father’, flyktning ‘refugee’, forfat- ter ‘author’, gutt ‘boy’, jente ‘girl’, kvinne ‘woman’, leder ‘leader’, lege ‘doctor’, lærer ‘teacher’, mann ‘man’, medlem ‘member’, mor ‘mother’, person ‘person’, president ‘president’, sjef ‘boss’, soldat ‘soldier’, trener ‘coach’, venn ‘friend’ aksje ‘stock’, artikkel ‘article’, bil ‘car’, bok ‘book’, brev ‘letter’, dag ‘day’, eiendom ‘property’, fly ‘airplane’, hus ‘house’, informasjon ‘information’, natt ‘night’, opp- gave ‘task’, opplysning ‘(piece of) information’, penge ‘coin/money’, pris ‘price’, produkt ‘product’, spørsmål ‘question’, svar ‘answer’, ting ‘thing’, vare ‘merchandise’ Table 6.1: Highly frequent (> 1000) animate and inanimate nouns; Norwegian 6.2.2 Noun selection As training data for the classifier, a set of 40 nouns were manually selected – 20 animate and 20 inanimate nouns, see table 6.1. The nouns were chosen based on two criteria: i) they are all Norwegian translations of nouns taken from the English WordNet, all of which are hyponyms of concepts distinguishing animacy, and ii) they are all highly frequent in the corpus.77 The animate nouns that were chosen were all hyponyms of the person- relation, which is itself a hyponym of animate thing/ living thing. A corpus study of Norwegian simple transitives in a sample of the Oslo Corpus showed that nouns expressing reference to animals, i.e. animate beings aside from hu- mans, are very infrequent in the corpus (Øvrelid 2004), and these types of nouns will therefore not be included in the following.78 There is no single category in WordNet that expresses the property of inanimateness, so the inan- imate nouns were taken from two main hypernyms which ensure a spread in terms of the abstractness of the noun, namely the concepts artifact, e.g. bil ‘car’, bok ‘book’ and abstraction, e.g. pris ‘price’, informasjon ‘information’. The choice of highly frequent nouns was made in order to ensure a sufficient amount of data to test our various features on. The threshold for these nouns 77There is no Norwegian WordNet resource, but there is a small semantic lexicon, SIMPLE, available which contains semantic information for approximately 10,000 nouns structured ac- cording to the notion of qualia-roles (Pustejovsky 1991). Unfortunately however, the choice of nouns in this lexicon is rather limited. For instance, there are entries for bildekk ‘car tyre’ and bilradio ‘car radio’, but no entry for bil ‘car’! Due to our initial frequency threshold, the information available in the SIMPLE lexicon was deemed too specialized. 78Øvrelid (2004) found that only 0.6% (5/889) of the common nouns in the sample refer to animals. 82 Acquiring animacy – experimental exploration was set at > 1000 occurrences. In section 6.4 below, we examine the effect of alternative threshold assignments on animacy classification. 6.2.3 Features of animacy The nouns listed in table 6.1 above are represented by a set of features which express properties of their morphosyntactic distribution. For each noun w, rel- ative frequencies of various morphosyntactic features fi are calculated from the corpus: freq( fi,w) freq(w) The features chosen to represent the nouns are presented below. Note that the features are all assumed to be noisy, as they are based on automatic annotation and simple regular expressions for extraction. We will test empirically in the experimental section whether these features capture the relevant distinctions with respect to animacy despite the noise caused by input data and feature approximation. Subject and object Subjects and objects tend to differ with respect to animacy and this tendency has been observed as frequency effects in a range of languages. We have, in particular, discussed the effect of relative animacy in transitive constructions. The proportion of subject and object occurrences for each noun is therefore recorded. For transitive subjects (SUBJ), we extract the number of instances where the noun in question is unambiguously tagged as subject and followed by a finite verb and an unambiguously tagged object.79 The frequency of direct objects (OBJ) for a given noun was approximated to the number of instances where the noun in question was unambiguously tagged as object. We here as- sume that an unambiguously tagged object implies an unambiguously tagged subject. However, by not explicitly demanding that the object is preceded by a subject, we also capture objects with a “missing” subject, such as relative clauses and infinitival clauses. 79The tagger works in an eliminative fashion, so tokens may bear two or more tags when they have not been fully disambiguated. 6.2 Data preliminaries 83 Genitive Genitive marking typically signals a semantic relation of possession, a relation which has been shown to favour animate possessors (Rosenbach 2002; Dahl and Fraurud 1996). However, this requirement is certainly not an absolute con- straint on the construction; semantic relationships figuring inanimate entities such as a part-whole relations, e.g. bilens hjul ‘the car’s wheel’ also occur commonly.80 The feature extraction for the genitive feature (GEN) counts the number of times each noun occurs with genitive case marking, i.e. the suffix -s. Passive Agentivity is also related to animacy, see section 3.5. Animate entities are in- herently sentient, capable of acting volitionally and causing an event to take place - all properties of the prototypical agent (Dowty 1991). The passive construction, or rather the property of being expressed as the demoted agent in a passive construction, is a possible approximator of agentivity. Transitive constructions tend to passivize better (hence more frequently) if the demoted subject bears a prominent thematic role, preferably agent. Norwegian has two ways of expressing the passive, a morphological passive (verb + s) and a pe- riphrastic passive (bli + past participle). The counts for the passive feature (PASS) include both types of passives preceding the by-phrase containing the noun lemma in question. Anaphoric reference In section 3.6, we discussed the idea that animate entities tend to be more individuated and more cognitively accessible. An entity which is highly indi- viduated and accessible is also more likely to be referred to again later on in discourse. Anaphoric reference is a phenomenon where the animacy of a ref- erent is clearly expressed. The personal pronouns distinguish their antecedents along the animacy dimension - animate han/hun ‘he/she’ vs. inanimate den/det ‘it-MASC/NEUT’. This is one reason why information regarding the animacy 80An alternative construction to the s-genitive in Norwegian is constructed by inserting the possessive pronoun sin between the possessor and the possessed, as in mannen sin bil ‘the man’s car’. The sin-genitive is to be preferred when the relation is one of possession (Faarlund, Lie and Vannebo 1997), hence often involving an animate possessor. However, data on this construction was far too sparse and yielded zero occurrences for a large number of the nouns (both animate and inanimate), and was hence abandoned. 84 Acquiring animacy – experimental exploration of a noun can be helpful in the task of coreference resolution (Ora˘san and Evans 2007). Coreference resolution is a complex problem, and certainly not one that we shall attempt to solve in the present context. However, we might attempt to come up with a metric that approximates the coreference relation in a manner adequate for our purposes. Hale and Charniak (1998) describe a method for extracting gender statistics for English nouns by making use of coreference approximations. Their most simple method is the “last noun seen” method, where an anaphoric link is established between the last noun of one sentence and an initial pronoun in the next. This method is reported to account for approximately 43% of all anaphoric coreferences in a hand-tagged subset of the Wall Street Journal corpus. They also make use of the Hobbs algorithm (Hobbs 1976), which relies on a phrase-structure parse of the sentence in ques- tion as well as the preceding text, and exploits syntactic cues for coreference. This strategy alone is reported to achieve an accuracy of 65.3% on the same WSJ subset. In our attempt to approximate coreference relations between a common noun and a subsequent personal pronoun, we make use of the fact that a per- sonal pronoun usually refers to a discourse salient element which is fairly re- cent in the discourse. Now, if a sentence only contains one core argument (i.e. an intransitive subject) and it is followed by a sentence initiated by a personal pronoun, it seems reasonable to assume that that these are likely to be coref- erent (Hale and Charniak 1998). (64) below shows an authentic example from the results for the noun mann ‘man’ taken from the Oslo Corpus: (64) Manneni ble pågrepet etter tre kvarters dramatisk biljakt. Hani var beruset og satt med den ladde haglen over knærne. The mani was apprehended after a three-quarter long car chase. Hei was intoxicated and sat with the loaded shot gun across his knees. For each of the nouns in table 6.1, we count the number of times it occurs as a subject with no subsequent object and an immediately following sentence initiated by (i) an animate personal pronoun (ANAAN) –han ‘he’, hun ‘she’ or de ‘they’, and (ii) an inanimate personal pronoun (ANAIN) – den ‘it-MASC’ or det ‘it-NEUT’. The 3rd person plural pronoun de ‘they’ is not a clear indi- cator of animacy since it may refer to both animate and inanimate referents, as in English. Merlo and Stevenson (2001) show that, in English, this plural pronoun exhibits a preference for animate reference and in a selection of 100 occurrences of this pronoun, they found that 76% of these had an animate an- tecedent.81 We therefore make the same assumption for Norwegian. Although 81Merlo and Stevenson (2001) make use of personal pronouns as indicators of argument structure for a verb. If the verb often occurs with an animate pronominal subject, they assume that it assigns an agentive role to its subject. 6.2 Data preliminaries 85 SUBJ OBJ GEN PASS ANAAN ANAIN REFL forfatter 0.1734 0.0809 0.0639 0.0020 0.0109 0.0034 0.0075 artikkel 0.0799 0.1091 0.0032 0.0032 0.0013 0.0032 0.0006 Figure 4: Example feature vectors. this is a possible source for mistakes in the counts, we assume that the general distribution of instances will still make the relevant distinction with regards to animacy. For the inanimate pronouns, the neuter form det ‘it-NEUT’ is problematic as this is also the expletive subject form, hence this pronoun often initiates a sentence, but has a clearly non-referential function. However, as there is no obvious way of automatically distinguishing between the pronominal and ex- pletive use, we count all occurrences of this pronoun when it initiates a follow- ing sentence. Another possibility would have been to exclude all occurrences of det ‘it-NEUT’ from the counts, with the consequence that this test would be inapplicable for the set of neuter nouns in our training set (8 nouns). Reflexive Reflexive pronouns represent another form of anaphoric reference which, con- trary to the personal pronouns, locate their antecedent locally, i.e. within the same clause. The third person reflexive pronoun seg ‘him/her/itself’ does not, however, position its antecedent along the animacy dimension. In the reflex- ive construction the subject and the reflexive object are, typically, coreferent and it describes an action directed at oneself. Although the reflexive pronoun in Norwegian does not distinguish for animacy, the agentive semantics of the construction might favour an animate subject. The feature of reflexive coreference (REFL) is more straightforward to ap- proximate, as this coreference takes place within the same clause. For each noun, the number of occurrences as a subject followed by a verb and the 3rd person reflexive pronoun seg ‘him-/her-/itself’ are counted. 6.2.3.1 Data overview For classification, each noun is represented as a feature vector of distributional features and is labeled with its class – animate or inanimate. Figure 4 shows the individual feature vectors representing the animate noun forfatter ‘writer’ and the inanimate noun artikkel ‘article’. 86 Acquiring animacy – experimental exploration Animate Inanimate Mean SD Mean SD # SUBJ 0.14 0.05 0.07 0.03 16813 OBJ 0.11 0.03 0.23 0.10 24128 GEN 0.04 0.02 0.02 0.03 7830 PASS 0.006 0.005 0.002 0.002 577 ANAAN 0.009 0.006 0.003 0.002 989 ANAIN 0.003 0.003 0.006 0.003 944 REFL 0.005 0.0008 0.001 0.0008 558 Table 6.2: Mean relative frequencies and standard deviation for each class (20 an- imate, 20 inanimate nouns) and feature, as well as total data points for each feature (#). The mean relative frequencies with standard deviations for each class – an- imate and inanimate – and feature are presented in table 6.2. The total data points for each feature following the data collection are also presented in the last column of table 6.2. As we can see, quite a few of the features express mor- phosyntactic cues that are rather infrequent. This is in particular true for the passive feature (PASS) and the anaphoric features ANAAN, ANAIN and REFL. When examining the features in table 6.2, however, these features still ex- press the relevant distinctions, and all differences between the means of the two groups are significant.82 Another point is that the values for the features that one would expect to be quite frequent, e.g. SUBJ and OBJ only range from about 3% to 14% of all occurrences. The reason for this is that the regular expressions designed to extract the counts require the subjects and objects in question to be unam- biguously tagged. This means that the transitive subjects and objects that are counted are only those that occur in a syntactic environment which clearly disambiguates them functionally.83 82Statistical significance was calculated with an unpaired t-test. We compared the mean of means between the group of animate and inanimate nouns, and found that all differences were significant – SUBJ,OBJ,REFL:p<.0001; ANAAN:p<.0005; PASS:p<.005; ANAIN:p<.01; GEN:p<.05. 83In practice this includes transitive complex VPs (due to the V2-property of Norwegian), i.e. VPs containing auxiliary or modal verbs, sentences where something other than the subject or object occupies sentence initial position, or subjects or objects appearing in subordinate clauses of different types, see section 4.2 6.3 Method viability 87 6.2.3.2 Other features In addition to the features presented in table 6.2, several other features were extracted, which did not exhibit a significant distinction. This was partly due to errors in the automatic analysis. For instance, indirect objects in ditransitive constructions, turned out to yield a result that was contrary to the expected results. The mean result for the animate class was 0.007%, whereas the inan- imate class had the higher count of 0.008%. However, a quick look at some of the extracted sentences shows that the tagger’s analysis of indirect objects is inaccurate. Other features that proved not to differ significantly for the two classes include morphological definiteness and the ‘last noun seen’ anaphoric reference approximation (Hale and Charniak 1998). 6.3 Method viability We start out by testing the viability of the method as such, i.e. whether unseen nouns may be classified for animacy based on a small set of linguistically mo- tivated distributional features. We also test the effect of the various features individually and in combination. 6.3.1 Experimental methodology The experimental methodology chosen for the classification experiments is similar to the one described in Merlo and Stevenson 2001 for verb classifi- cation. We employ decision tree learning for construction of classifiers, see section 5.2.1 and leave-one-out training and testing of the classifiers. In leave- one-out cross validation, each noun is used as test data exactly once, whereby the n− 1 other instances are used for training the classifier. This is a good option when the set of training data is small, as in the present context. In addi- tion, all our classifiers employ the boosting option for constructing classifiers (Quinlan 1993).84 For calculation of the statistical significance of differences in the performance of classifiers tested on the same data set, McNemar’s test (Dietterich 1998) is employed. Note however, that due to the small data set, the test provides a very strict criterion by which to determine difference. We therefore report results even though they are not statistically significant, but remark on significance explicitly wherever relevant. The baseline we employ is a random baseline of 50% accuracy. 84In boosting, several classifiers are constructed during training and applied to each test in- stance, whereby the classification is determined by majority voting. 88 Acquiring animacy – experimental exploration Feature Accuracy (%) SUBJ 85.0 REFL 82.5 OBJ 72.5 GEN 72.5 ANAAN 67.5 PASS 62.5 ANAIN 50.0 Table 6.3: Accuracy for classifiers trained with individual features. Used Not Used Accuracy (%) 1. SUBJ OBJ GEN PASS ANAAN ANAIN REFL 87.5 2. OBJ GEN PASS ANAAN ANAIN REFL SUBJ 85.0 3. SUBJ GEN PASS ANAAN ANAIN REFL OBJ 87.5 4. SUBJ OBJ PASS ANAAN ANAIN REFL GEN 85.0 5. SUBJ OBJ GEN ANAAN ANAIN REFL PASS 82.5 6. SUBJ OBJ GEN PASS ANAIN REFL ANAAN 82.5 7. SUBJ OBJ GEN PASS ANAAN REFL ANAIN 87.5 8. SUBJ OBJ GEN PASS ANAAN ANAIN REFL 75.0 Table 6.4: Accuracy for classifiers trained with all features and ‘all minus one’. 6.3.2 Experiment 1 Table 6.3 shows the performance of each individual feature in the classification of animacy. As we can see, the performance of the features differ quite a bit, ranging from mere baseline performance (ANAIN) to a 70% error reduction compared to the baseline (SUBJ). The first line of table 6.4 shows the perfor- mance using all the seven features collectively where we achieve an accuracy of 87.5%, an error reduction of 75%. The SUBJ, REFL, OBJ and GEN features employed individually are the best performing individual features and their classification performance do not differ significantly from the performance of the combined classifier, whereas the rest of the individual features do (p<.05). The subsequent lines (2-8) of table 6.4 show the accuracy results for clas- sification using all features except one at a time. This provides an indication of the contribution of each feature to the classification task. In general, the re- moval of a feature causes a 0%-12.5% deterioration of results, however, only the difference in performance caused by the removal of the REFL feature is significant (p<.05). Since this feature is one of the best performing features 6.3 Method viability 89 individually, it is not surprising that its removal causes a notable difference in performance. The removal of the ANAIN feature, on the other hand, does not have any effect on accuracy whatsoever. This feature was the poorest perform- ing feature with a baseline, or mere chance, performance. 6.3.2.1 Discussion The above experiments have shown that the classification of animacy for com- mon nouns is achievable using morphosyntactic distributional data from a cor- pus. The results of the experiments are encouraging, and due to the fact that the features are linguistically motivated, hopefully also generalisable to a larger set of nouns. However, several questions remain unanswered following these initial experiments. We have chosen to classify along a binary dimension (animate vs. inani- mate) with a small set of nouns. Two related objections may be put forward at this point. Firstly, it might be argued that a binary dimension such as this is artificial and that there should be a finer subdivision of nouns. Zaenen et al. (2004) describe an encoding scheme for the manual encoding of animacy in- formation in part of the English Switchboard corpus. They make a three-way distinction between human, other animates, and inanimates, and also provide further subdivisions of these. The ‘other animates’ category describe a rather heterogeneous group of entities: organizations, animals, intelligent machines and vehicles. What these have in common is that they may all be construed linguistically as animate beings, even though they, in the real world, are not. Interestingly, the two misclassified inanimate nouns in our experiments were bil ‘car’ and fly ‘airplane’, both vehicles.85 They exhibited a more agentive pat- tern which showed up in the transitive subject feature, the passive feature and the reflexive feature, in particular. However, they did not pattern completely with the animate nouns, they had a high object count and behaved like the inanimate nouns when it came to anaphoric pronouns. Secondly and related to the above, the choice of nouns in the experiment might be considered too lim- ited. Had we chosen to include, for instance, nouns that have a metonymic use, e.g., organizations, the classification into only two classes might have been less successful. However, we chose to start out with a binary classification in order to test the viability of the method and its suitability for the classification task. The features represent linguistic dimensions which have been claimed to correlate with animacy, such as syntactic functions and thematic roles. One 85Inanimate subjects have been claimed to be ungrammatical in Japanese. However, sentence production experiments have employed examples like A taxi picked up a traveler, which were deemed acceptable (Branigan, Pickering and Tanaka 2008). 90 Acquiring animacy – experimental exploration might ask whether the chosen features represent sufficient information to base classification on. One of the misclassified animate nouns was venn ‘friend’, a clearly animate noun. However, according to our seven chosen features, this noun largely patterns with the inanimate nouns. When considering it, this prob- ably also makes sense, as we are basing our classification of a real world prop- erty only on our linguistic depiction of it. A friend is probably more like a physical object in the sense that it is someone one likes/hates/loves or other- wise reacts to, rather than being an agent that acts upon its surroundings. This is reflected in a low proportion of subject occurrences (.076), as well as reflex- ive reference (.0012) and a high proportion of direct object occurrences (0.19), see table 6.2. In conclusion then, we have seen that the method yields promising results for classification of animacy when applied to Norwegian common nouns us- ing a set of seven linguistically motivated features of animacy. The features, which capture syntactic distributional properties of the nouns where animacy has been shown to cause frequency effects, proved important in classification. 6.4 Robustness The classification experiments reported above impose a frequency constraint (absolute frequencies >1000) on the nouns used for training and testing in or- der to study the interaction of the different features without the effects of sparse data. In the light of the results from these experiments, however, it might be interesting to further test the performance of our features in classification as the frequency constraint is gradually relaxed. To this end, three sets of com- mon nouns each counting 40 nouns (20 animate and 20 inanimate nouns) were randomly selected from groups of nouns with approximately the same fre- quency in the corpus. The first set included nouns with an absolute frequency of 100±20 (∼100), the second of 50±5 (∼50) and the third of 10±2 (∼10). Feature extraction followed the same procedure as in experiment 1, relative frequencies for all seven features were computed and assembled into feature vectors, one for each noun. 6.4.1 Experiment 2: Effect of sparse data on classification In order to establish how much of the generalizing power of the classifier is lost when the frequency threshold for the extraction of nouns is lowered, an experiment was conducted which tested the performance of the earlier clas- sifier, i.e. the classifier trained on the more frequent nouns, as applied to the 6.4 Robustness 91 Freq All SUBJ OBJ GEN PASS ANAAN ANAIN REFL > 1000 87.5 85.0 72.5 72.5 62.5 67.5 50.0 82.5 ∼100 70.0 75.0 80.0 72.5 65.0 52.5 50.0 60.0 ∼50 57.5 75.0 62.5 77.5 62.5 57.5 50.0 55.0 ∼10 52.5 52.5 65.0 50.0 57.5 50.0 50.0 50.0 Table 6.5: Accuracy obtained when applying the high-frequency classifiers trained with all and individual features to the lower-frequency nouns (∼100,∼50,∼10). three groups of less frequent nouns. As we can see from the first column in ta- ble 6.5, we observe a clear deterioration of results, from our earlier accuracy of 87.5% to new accuracies ranging from 70% to 52.5%, barely above the base- line. Not surprisingly, the results decline steadily as the absolute frequency of the classified noun is lowered. Accuracy results provide an indication that the classification is problem- atic. However, it does not indicate what the damage is to each class as such. A confusion matrix is in this respect more informative. Confusion matrices for the classification of the three groups of nouns, ∼100, ∼50 and ∼10, are provided in table 6.6. These clearly indicate that it is the animate class which suffers when data becomes more sparse. The percentage of misclassified ani- mate nouns increases drastically from 50% at ∼100 to 80% at ∼50 and finally 95% at ∼10. The classification of the inanimate class remains pretty stable throughout. The fact that a majority of our features (SUBJ, GEN, PASS, ANAAN and REFL) target animacy, in the sense that a higher proportion of animate than inanimate nouns exhibit the feature, gives a possible explanation for this. As data gets more limited, this distinction becomes harder to make, and the an- imate feature profiles come to increasingly resemble the inanimate. Because the inanimate nouns are expected to have low proportions (compared to the animate) for all these features, the data sparseness is not as damaging. In order to examine the effect of the lowering of the frequency threshold on each individual feature , we also ran classifiers trained on the high frequency nouns with only individual features on the three groups of new nouns. These results are depicted in columns 3-9 in table 6.5. As the frequency threshold is lowered, the performance of the classifiers employing all features and those trained only on individual features become more similar. For the ∼100 nouns, only the two anaphoric features ANAAN and the reflexive feature REFL, have a performance that differs significantly (p<.05) from the classifier employing all features. For the ∼50 and ∼10 nouns, there are no significant differences between the classifiers employing individual features only and the classifiers 92 Acquiring animacy – experimental exploration ∼100 nouns (a) (b) ← classified as 10 10 (a) class animate 2 18 (b) class inanimate ∼10 nouns (a) (b) ← classified as 1 19 (a) class animate 20 (b) class inanimate ∼50 nouns (a) (b) ← classified as 4 16 (a) class animate 1 19 (b) class inanimate Table 6.6: Confusion matrices for classification of lower frequency nouns with the high-frequency classifier. trained on the feature set as a whole. This indicates that the combined classi- fiers no longer exhibit properties that are not predictable from the individual features alone and they do not generalize over the data based on the combina- tions of features. In terms of accuracy, a few of the individual features even outperform the collective result. On average, the three most frequent features, the SUBJ, OBJ and GEN features, cause a 9.5% and 24.6% reduction of the error rate for the ∼100 and ∼50 nouns, respectively. For the lowest frequency nouns (∼10) we see that the OBJ feature alone reduces the errors by almost 24%, from 52.5% to 65 % accuracy. In fact, the OBJ feature seems to be the most stable feature of all the features. When examining the means of the results extracted for the different features, the OBJ feature is the feature which maintains the largest difference between the two classes as the frequency threshold is lowered. The second most stable feature in this respect is the SUBJ feature. Figure 5 clearly illustrates the effect of sparse data on classification accuracy. The group of experiments reported above shows that the lowering of the frequency threshold for the classified nouns causes a clear deterioration of results in general, and most gravely when all the features are employed together. 6.4.2 Experiment 3: Back-off features The three most frequent features, the SUBJ, OBJ and GEN features, were the most stable in the two experiments reported above and had a performance which did not differ significantly from the combined classifiers throughout. In light of this we ran some experiments where all combinations of these more frequent features were employed. The results for each of the three groups of nouns is presented in table 6.7. The exclusion of the less frequent features has 6.4 Robustness 93 50 55 60 65 70 75 80 85 90 0 100 200 300 400 500 600 700 800 900 1000 All Subj Obj Gen Figure 5: Accuracy as a function of absolute noun frequencies for classifiers em- ploying all features, as well as the individual SUBJ, OBJ and GEN classi- fiers. Freq SUBJ&OBJ&GEN SUBJ&OBJ SUBJ&GEN OBJ&GEN ∼100 87.5 87.5 77.5 85.0 ∼50 82.5 90.0 70.0 77.5 ∼10 57.5 50.0 50.0 47.5 Table 6.7: Accuracy obtained when applying classifiers trained with combinations of the most frequent features to the lower-frequency nouns. a clear positive effect on the accuracy results. For the ∼100 and ∼50 nouns, the performance has improved compared to the classifier trained with the full set of features, as well as the classifiers trained with individual features. The classification performance for these nouns is now identical or only slightly worse than the performance for the high-frequency nouns in experiment 1. For the ∼10 group of nouns, the performance is, at best, the same as for all the features and at worse fluctuating around baseline. In general, the best performing feature combinations are SUBJ&OBJ&GEN and SUBJ&OBJ. These two differ significantly (p<.05) from the results ob- tained by employing all the features collectively for both the ∼100 and the ∼50 nouns, hence indicate a clear improvement. The feature combinations both contain the two most stable features – one feature which targets the ani- mate class (SUBJ) and another which target the inanimate class (OBJ) – a prop- erty which facilitates distinction even as the general differences between the 94 Acquiring animacy – experimental exploration 50 55 60 65 70 75 80 85 90 10 20 30 40 50 60 70 80 90 100 All Subj&Obj&Gen Subj&Obj Figure 6: Accuracy as a function of absolute noun frequencies for classifiers em- ploying all features, as well as the backed off SUBJ&OBJ&GEN and SUBJ&OBJ classifiers. two decrease. Figure 6 illustrates the clear improvement of feature back-off compared to the full set of features. It seems, then, that backing off to the most frequent features might consti- tute a partial remedy for the problems induced by data sparseness in the clas- sification. The feature combinations SUBJ&OBJ&GEN and SUBJ&OBJ both significantly improve the classification performance and enable us to maintain the same accuracy for the ∼100 and ∼50 nouns as for the higher frequency nouns, reported in experiment 1. 6.4.3 Experiment 4: Back-off classifiers Another option, besides a back-off to more frequent features in classification, is to back off to another classifier, i.e. a classifier trained on nouns with a similar frequency. An approach of this kind attempts to exploit any group similarities that these nouns may have in contrast to the mores frequent ones. In this set of experiments, classifiers were trained and tested using leave- one-out cross-validation on the three groups of lower frequency nouns and em- ploying individual, as well as various other, feature combinations. The results for all features as well as individual features are summarized in table 6.8. As we can see, the result for the classifier employing all the features has improved somewhat compared to the corresponding classifiers in experiment 3 6.4 Robustness 95 Freq All SUBJ OBJ GEN PASS ANAAN ANAIN REFL > 1000 87.5 85.0 72.5 72.5 62.5 67.5 50.0 82.5 ∼100 85.0 52.5 87.5 65.0 70.0 50.0 57.5 50.0 ∼50 77.5 77.5 75.0 75.0 50.0 50.0 50.0 50.0 ∼10 52.5 50.0 62.5 50.0 50.0 50.0 50.0 50.0 Table 6.8: Accuracy obtained when applying lower-frequency classifiers trained with all and individual features to new lower-frequency nouns. Perfor- mance with the high-frequency classifier (> 1000) is provided for com- parison. Freq SUBJ&OBJ&GEN SUBJ&OBJ SUBJ&GEN OBJ&GEN ∼100 85.0 85.0 67.5 82.5 ∼50 75.0 80.0 75.0 70.0 ∼10 62.5 62.5 50.0 62.5 Table 6.9: Accuracy obtained when applying lower-frequency classifiers trained with combinations of the most frequent features to new lower-frequency nouns. (as reported above in table 6.5) for all our three groups of nouns. This indicates that there is a certain group similarity for nouns of similar frequency that is captured in the combination of the seven features. However, backing off to a classifier trained on nouns that are more similar frequency-wise does not cause an improvement in classification accuracy. Apart from the SUBJ feature for the ∼100 nouns, none of the other classifiers trained on individual or all features for the three different groups differ significantly (p<.05) from their counterparts in experiment 3. As before, combinations of the most frequent features were employed in the new classifiers trained and tested on each of the three frequency-sorted groups of nouns. In the terminology employed above, this amounts to a back- ing off both classifier- and feature-wise. The accuracy measures obtained for these experiments are summarized in table 6.9. For these classifiers, the backed off feature combinations do not differ significantly from their counterparts in experiment 3, where the classifiers were trained on the more frequent nouns with feature back-off. 96 Acquiring animacy – experimental exploration 6.4.4 Summary Experiments 1–4 have shown that the classification of animacy for Norwegian common nouns is achievable using distributional data from a morphosyntacti- cally annotated corpus. The chosen morphosyntactic features of animacy have proven to distinguish well between the two classes. As we have seen, the tran- sitive subject, direct object and morphological genitive provide stable features for animacy even when the data is sparse(r). Four groups of experiments have been reported which indicate that a reasonable remedy for sparse data in an- imacy classification consists of backing off to a smaller feature set in classi- fication. These experiments indicate that a classifier trained on a small set of highly frequent nouns (experiment 1) backed off to the most frequent features (experiment 3) sufficiently capture generalizations which pertain to nouns with absolute frequencies down to approximately fifty occurrences and enables an unchanged performance approaching 90% accuracy. 6.5 Machine learning algorithm Decision-tree learning represents an eager machine learning algorithm; gen- eralization over the training data is constructed in the form of a decision tree prior to the observation of unseen test instances. In the following we inves- tigate whether the animacy classification task generalizes to a lazy machine learning algorithm: Memory-Based Learning. See section 5.2 for more on ma- chine learning and the eager-lazy distinction. Memory-based learning, a class of instance-based learning algorithms, has been applied successfully to a range of NLP tasks, such as named-entity recog- nition (Tjong Kim Sang 2002a), parsing (Kübler 2004) and semantic role la- beling (Morante and Busser 2007). In the following we experiment with the application of memory-based learning (MBL) to the animacy data described above. We compare the performance of the MBL classifiers to the correspond- ing decision-tree classifiers and experiment with feature weighting for classi- fication of lower frequency nouns. 6.5.1 Experimental methodology All experiments make use of the TiMBL software package for Memory-Based Learning (Daelemans et al. 2004), see section 5.2.2 for more detail. As before, we employ leave-one-out training and testing, as well as McNemar’s test for statistical significance. 6.5 Machine learning algorithm 97 6.5.2 Experiment 5: High frequency nouns The first set of experiments using TiMBL was conducted on the set of 40 high frequency nouns presented in table 6.1 which have an absolute frequency in the corpus of a thousand occurrences or more. We also experiment with different parameter settings available in TiMBL, as well as combinations of these, in order to locate the optimal setting for classification. Classification in Memory-Based Learning is performed by comparison of new instances to the set of training instances. The determination of relevant ex- amples, the k-nearest neighbors, is therefore an important component of learn- ing. In this respect, we may experiment with different sized neighborhoods. We may also vary the influence of the neighbors on classification, which is either performed by majority voting or a weighted voting, where closer neigh- bors are given more weight in determining the class of a new instance. With inverse linear scaling, the weights of neighbor instances in classification are scaled linearly to reflect distance in vector space.86 In experiments 1-4 we ex- amined the influence of the various features in classification. TiMBL supports feature weighting, where features are given differing weights during classifi- cation, or rather, in determining the k-nearest neighbors for a given instance. During parameter optimization we experiment with information-based feature weighting schemes Information Gain and Gain Ratio. These approximate the information contained in a feature during training and weight accordingly at classification. We perform a set of experiments where we test all possible variations over the following parameters: Feature weighting No feature weighting (NO), Information Gain (InfoGain), Gain Ratio Nearest neighbors k = {1,5,7,17} Class voting Majority, Inverse Linear The various parameter settings do not have a great effect on the results, and, in fact, the best accuracy of 95% is achieved using no feature weighting, k = 1 and the default, majority class voting. This indicates that all the features contribute positively to classification of the high frequency nouns. This corroborates the findings in the decision-tree experiments for this group of nouns, where the classifier employing the total set of features was the best performing. Varying the k parameter increases the set of nearest neighbors, but does not cause an 86The simpler, inverse linear scaling outperforms other class voting variants (inverse distance weight and exponential decreasing weights) in a majority of cases (Daelemans et al. 2004: 25). 98 Acquiring animacy – experimental exploration Nouns NO InfoGain GainRatio ChiSquare SharedVar ∼100 80.0 80.0 80.0 80.0 80.0 ∼50 77.5 75.0 72.5 77.5 77.5 ∼10 50.0 50.0 50.0 50.0 50.0 Table 6.10: Accuracy for classifiers with all features and only high frequency nouns and tested on lower frequency nouns, employing different sorts of fea- ture weighting - no weighting (NO), Information Gain, Gain Ratio, Chi Square and Shared Variance. improvement of results. This is explained by the fact that the training and test- ing examples are so few (only 40), so enlarging the neighborhood only serves to introduce errors. None of the results for the different experiments differ sig- nificantly from the result obtained using a decision-tree classifier on this group of nouns. 6.5.3 Experiment 6: Lower frequency nouns In section 6.4 we observed that the performance of the decision-tree classifiers deteriorated notably when moving from high frequency to lower frequency nouns, ranging from 70.0% to around baseline accuracy of 52.5% for the set of lowest frequency nouns (∼10). Furthermore we found that by employing only a subset of the initial features, we were able to maintain a performance similar to that of the high frequency nouns also for the ∼100 and ∼50 nouns. This set of experiments examine the performance of classifiers trained using memory-based learning on the same data sets. We will in particular look at possibilities for replacing the feature back-off by various schemes for feature weighting. 6.5.3.1 No feature weighting The first column in table 6.10 shows the results for the three groups of nouns when employing the best feature setting from the first group of experiments (no feature weighting, k=1, majority voting). As expected, these results are lower than the performance obtained when classifying the high frequency nouns (95%). The accuracy for the ∼100 nouns (80%) is better, however, not signifi- cantly better, than the corresponding result employing a decision-tree classifier (70%). The result for the ∼50 nouns (77.5%), however, is significantly better than the corresponding decision-tree result (57.5%), and differs only slightly 6.5 Machine learning algorithm 99 from the result for the ∼100 nouns. This is an interesting result because it indicates that MBL deals better with sparse data in this case. A problem which was identified above is that it is the animate class in par- ticular that suffers as data gets sparser. A majority of the features (SUBJ, GEN, PASS, ANAAN and REFL) target animacy in the sense that a higher proportion of animate than inanimate nouns exhibit the feature. As data gets more lim- ited, the animate feature profiles become increasingly similar to the inanimate profiles, hence are frequently misclassified. If we examine the MBL results for the ∼50 nouns we find that the percentage of misclassified animate nouns has dropped dramatically compared to the decision-tree result, from 80% to 35%. A closer look at the most similar data point, i.e. the nearest neighbor, for each of the animate instances that were misclassified by the decision-tree classifier, but correctly classified by MBL, reveals that two instances are recurrent as singleton nearest neighbor in a clear majority of these cases (78%). These two instances are not, however, a nearest neighbor to any of the high frequency nouns from Experiment 1, hence are in that respect outliers. The above observations are very much in line with those made in Daele- mans, van den Bosch and Zavrel 1999, who argue against the editing away of exceptions. In their terminology, the two examples above would have a class prediction strength of zero for the high frequency nouns, making them very bad class predictors and candidates for editing.87 However, these are exactly the examples responsible for the significant improvement of results for the ani- mate∼50 nouns. This highlights an important difference between decision tree learning and memory-based learning which resides in the fact that the former employs an eager machine learning algorithm, whereas the latter a lazy one. It is inherent in decision-trees that they represent some sort of generalization over the data. The C5.0 algorithm (Quinlan 1993) abstracts away from exceptions through pruning of the tree and always prefers smaller trees to larger ones. It is then highly likely that the properties of the two exceptions were pruned away in the decision-tree approach, leaving more of the ∼50 nouns for mis- classification. MBL, on the other hand, conserves all examples and does not in this sense generalize over the data. This property proved to be beneficial in the classification of the lower frequency nouns and points to a possible advantage of lazy learning. 87The class prediction strength of an instance is the ratio of the number of times the instance is a nearest neighbor of another instance with the same class and the number of times that the instance is the nearest neighbor of another instance regardless of the class (Daelemans, van den Bosch and Zavrel 1999: 12). 100 Acquiring animacy – experimental exploration SUBJ OBJ GEN ANAAN ANAIN PASS REFL 0.32 0.47 0.15 0.02 0.02 0.01 0.01 Table 6.11: Feature weights representing relative frequency in the training set of high frequency nouns. 6.5.3.2 TiMBL’s feature weighting TiMBL offers a range of different feature weighting schemes and the remain- ing columns in table 6.10 show the results from employing different feature weighting settings to the three different groups of nouns. As mentioned, the In- foGain and GainRatio settings are entropy-based measures (see section 5.2.1), whereas ChiSquare and SharedVar(iance) employ the χ2-test of statistical sig- nificance to compute differences in the distributions of features in the training data.88 These do not have any significant effect, and in a majority of cases ac- tually have no effect at all. This is not so surprising, as the feature weights are calculated based on the training data, the high frequency nouns in this case. The measures are based on the informativity of the feature in this data set, and quite correctly points out the REFL feature as one of the most informative features. However, the exclusion of features in the earlier decision-tree exper- iments was not done on the basis of informativity, but rather on the basis of frequency, under the assumption that features more frequent in the data also provide more stable class predictors. In fact, the REFL feature is one of the rarest features in the feature set and does not hold up well against sparse data. 6.5.3.3 Frequency-based feature weighting Based on the above considerations, we formulate an alternative, frequency- based feature weighting scheme. Here, we employ the conditional relative fre- quency of a feature f – the number of data points covered by a feature relative to all data points in the training data – as weights: freqweight( f ) = ∑i freq( f ,wi)∑i ∑ j freq( fi,w j) The weights for each of the features are presented in table 6.11 below. As in the decision-tree experiments, the OBJ, SUBJ and GEN features are the top 88With numeric features, as in the present study, the feature values are discretized prior to application of the ChiSquare and SharedVar weighting schemes. SharedVar extends ChiSquare by correcting for degrees of freedom (Daelemans et al. 2004: 22). 6.5 Machine learning algorithm 101 Nouns FreqWeights SUBJ&OBJ&GEN SUBJ&OBJ ∼100 90.0 95.0 90.0 ∼50 77.5 75.0 87.5 ∼10 67.5 67.5 62.5 Table 6.12: Accuracy obtained when applying classifiers with frequency-based fea- ture weighting or combinations of the most frequent features to lower frequency nouns. three most frequent, hence with the highest weights. The results from these experiments are shown in the first column of table 6.12. The feature weighting results in an improvement of the accuracy for the ∼100 and ∼10 nouns, how- ever, this does not constitute a significant improvement compared to the results with no feature weighting (as reported in table 6.10). Finally, experiments ignoring all features except the top three (SUBJ, OBJ, GEN) and top two (SUBJ, OBJ) most frequent features were performed. These results are reported in the remaining columns in table 6.12. Only the result for the ∼100 nouns (95%) employing the three most frequent features improve significantly on the result obtained with no feature weighting. However, none of these results differ significantly from the ones obtained in the corresponding decision-tree experiments. 6.5.4 Summary The aim of this section was dual. First, we applied Memory-Based Learning to the classification of animacy for Norwegian common nouns. Second, these results were compared with corresponding results achieved when employing decision-trees to the same task in earlier experiments. For the set of highly frequent nouns, we achieved a classification accuracy of 95%. The different parameter settings in TiMBL did not affect the results notably and the best ac- curacy was achieved employing the most “basic” settings - no feature weight- ing, k = 1, and majority class voting. When this classifier was applied to the three groups of lower frequency nouns, the results deteriorated somewhat to 80% for the ∼100 nouns and 77.5% for the ∼50 nouns and quite drastically, a mere baseline performance, for the ∼10 nouns. Based on the previous results from the decision-tree classification, the results for the lower frequency nouns were to be expected. In general, the performance of the Memory-Based learner did not differ sig- nificantly from that of the decision-tree classifiers, hence it is difficult to draw firm conclusions with respect to superiority of one machine learning algorithm 102 Acquiring animacy – experimental exploration over another. As mentioned already, the lack of significance in differences may be partially due to the size of the data set, a question to which we shall return in chapter 7. We may conclude, however, that the results generalize across machine-learning algorithms. 6.6 Class granularity: classifying organizations The earlier experiments have shown a binary classification of animacy to be worthwhile, with best accuracies approaching 95%. Zaenen et al. (2004) pro- pose that nouns denoting organizations inhabit an intermediate position with respect to the two other main categories in an animacy hierarchy: animate and inanimate.89 This section describes the automatic classification of organiza- tion nouns along the animacy dimension already established in the previous sections. In doing so, we assess distributional evidence for such a distinction and examine how a more fine-grained notion of animacy affects our earlier results in classification. Under the assumption that the set of features outlined above serve to capture important aspects of the property of animacy, we may in turn examine their generalization to a new set of nouns, namely nouns which denote organizations. 6.6.1 Data The set of organization nouns employed in this study was collected while an- notating sentences for a corpus study of animacy in Norwegian (Øvrelid 2004). The nouns consist primarily of collective nouns or nouns that have a regular metonymic usage where they are employed to refer to organizations, see sec- tion 3.2.4. Following Garretson et al. 2004, we annotate as organizations nouns which denote collectivities of humans which display group identity. The im- plicational hierarchy in (65) illustrates the distinction between the human class and the class of organizations. (65) Implicational hierarchy distinguishing humans and organizations (Garretson et al. 2004): chartered/official > temporally stable > collective action, voice or purpose > collective The hierarchy in (65) states that anything that is chartered or official is also temporally stable, has a collective voice etc., but not vice versa. The cut-off 89See section 3.2 for more on gradience in the animacy dimension. 6.6 Class granularity: classifying organizations 103 Animate Inanimate Organizations Mean SD Mean SD Mean SD # SUBJ 0.14 0.05 0.07 0.03 0.20 0.10 31537 OBJ 0.11 0.03 0.23 0.10 0.06 0.03 28046 GEN 0.04 0.02 0.02 0.03 0.12 0.06 16419 PASS 0.006 0.005 0.002 0.002 0.012 0.015 1203 ANAAN 0.009 0.006 0.003 0.002 0.0009 0.001 1047 ANAIN 0.003 0.003 0.006 0.003 0.005 0.003 1253 REFL 0.005 0.0008 0.001 0.0008 0.004 0.0017 840 Table 6.13: Mean relative frequencies and standard deviations for each class (20 animate, 20 inanimate, 20 organization nouns) and feature, as well as total data points for each feature (#). point between human and organization is here set at being temporally stable. Both organizations and groups of humans (e.g. a crowd or a mob) can be col- lective and have a collective action, voice or purpose, however only organiza- tions are in addition temporally stable and possibly also chartered or official. In order to control for frequency effects, only the organization nouns that occurred more than 800 times in the corpus were included. This is close to the restriction imposed on the animate and inanimate nouns, hence make the groups directly comparable. As before, we ensure a uniform distribution in the training data and employ 20 organization nouns in the study. These nouns are presented in (66): (66) administrasjon ‘administration’, bank ‘bank’, bedrift ‘company’, bystyre ‘city council’, departement ‘ministry’, forening ‘association’, fylkeskommune ‘county’, komité ‘committee’, kommisjon ‘commission’, kommune ‘municipality’, kommunestyre ‘municipality board’, lag ‘team’, myndighet ‘authority’, organisasjon ‘organization’, parti ‘party’, regjering ‘government’, byrett ‘city court’, stat ‘state’, styre ‘board’, utvalg ‘committee’ Feature extraction for these nouns is performed in the same manner as for the animate and inanimate nouns. The mean relative frequencies obtained for each of these features is represented in the rightmost columns of table 6.13, where we also provide total data points for the data set consisting of all three classes. The total data points covered by each feature for the organization nouns only are as follows: SUBJ: 14724, OBJ: 3918, GEN: 8589, PASS: 626, ANAAN: 58, ANAIN: 309, REFL: 282. It is worth noting that the relative frequencies of the various features dif- fer from the animate and inanimate classes in several ways. We see that the 104 Acquiring animacy – experimental exploration Feature # Organim # Orgall ALL 1.00 SUBJ 0.95 OBJ 1.00 GEN 1.00 PASS 0.85 ANAAN 0.00 ANAIN 0.45 REFL 0.75 Table 6.14: Proportions of organization nouns classified as animate when classify- ing along the binary animacy dimension, employing all and individual features. proportion of subjects and genitives is notably high for these nouns, signifi- cantly higher, in fact, than the frequencies observed for the animate nouns.90 We also find that there is quite a bit of variation, as represented by the standard deviation. With respect to anaphoric reference, the organization nouns on aver- age differ most markedly from the animate class, and are in this respect more similar to the inanimate nouns. 6.6.2 Experiment 7: Granularity We now proceed to investigate further properties of the organization nouns by examining a three-way classification task, based on the same feature set as ear- lier. The experimental methodology is identical to the one employed in section 6.5.2 above, employing an MBL learner with leave-one-out cross validation.91 6.6.2.1 Animate or inanimate? The first experiment involved testing the classifier trained only on the binary classified nouns on the new set of organization nouns. The main point of this experiment was to study how the classifier deals with these new nouns, whether they are classified as being animate or inanimate in any systematic way. 90As in section 6.2.3, statistical significance was calculated with an unpaired t-test. 91For all experiments with more than one feature we employed the most basic settings (k = 1 and no feature weighting), following the parameter optimization in section 6.5.2. For the experiments where we test only one feature at a time, we increased the number of nearest 6.6 Class granularity: classifying organizations 105 Table 6.14 shows the proportion of organization nouns classified as an- imate by the old binary classifier. The results indicate that the organization nouns exhibit overall distributional properties which are more similar to the animate nouns than to the inanimate nouns. All of the organization nouns were classified as animate when all seven features were employed. By varying the features used during classification we obtain a clearer picture of where the an- imate characteristics of the organization nouns surface. Among the individual features the OBJ and GEN features classify all the organization nouns (100%) as being animate when employed individually. The subject (95.0%) and passive (85.0%) features are also strong indicators of animateness for the organization nouns. Another case worth noticing is the feature ANAAN, which classifies all the organization nouns as inanimate (hence 0.0% as animate). All of these results corroborate the proportional relationships observed in the relative fre- quencies for each feature. Just like the animate nouns, organizations have few object occurrences and a higher proportion of genitive forms, compared to the inanimate group of nouns. The results should not, however, be compared to the performance of the features as predictors of the classes animate and inan- imate. Rather, the performance of the features in this context is neither good nor poor, but simply indicators of the degree of animacy these nouns exhibit in the various morphosyntactic constructions covered by the different features. 6.6.2.2 Three-way classification The earlier classification experiment showed that the organization nouns had more in common with animate than inanimate nouns when it came to the mor- phosyntactic distributional properties measured by our seven features. Whereas the earlier experiment tested which other class of nouns the organization nouns are more alike, this experiment tests whether they are different enough to en- able a three-way classification. We investigate whether the organization nouns might be better captured by an intermediate category ’organization’. This would indicate that these nouns constitute a natural group, which share a set of prop- erties disjoint from the animate and inanimate. In order to test the validity of a new category, a new data set was con- structed by concatenating the data of high-frequency animate and inanimate nouns with the data set consisting of organization nouns. It thus contains a uniform distribution of classes, i.e. an equal number of cases from each of the three categories - animate, organization and inanimate. This new data set con- sists of distributional data, as summarized in table 6.13, for 60 nouns, 20 from neighbors to three (k = 3) in order to achieve somewhat more informed similarities and control for the influence of outliers in the classification space. 106 Acquiring animacy – experimental exploration Feature Accuracy (%) SUBJ 75.0 OBJ 68.3 GEN 71.7 PASS 43.3 ANAAN 50.0 ANAIN 26.7 REFL 36.7 Table 6.15: Accuracy for classifiers with individual features in 3-way animacy clas- sification. Used Not Used Accuracy (%) 1. SUBJ OBJ GEN PASS ANAAN ANAIN REFL 88.3 2. OBJ GEN PASS ANAAN ANAIN REFL SUBJ 86.7 3. SUBJ GEN PASS ANAAN ANAIN REFL OBJ 81.7 4. SUBJ OBJ PASS ANAAN ANAIN REFL GEN 85.0 5. SUBJ OBJ GEN ANAAN ANAIN REFL PASS 90.0 6. SUBJ OBJ GEN PASS ANAIN REFL ANAAN 81.7 7. SUBJ OBJ GEN PASS ANAAN REFL ANAIN 78.3 8. SUBJ OBJ GEN PASS ANAAN ANAIN REFL 80.0 Table 6.16: Accuracy for classifiers with all features and ‘all minus one’ in 3-way animacy classification. each class. A classifier was constructed and evaluated by means of leave-one- out cross validation. Since this is a three-way classification task, we assume a random baseline of 33.3%. The results are summarized in tables 6.15-6.16. The classifier constructed by means of all the seven features receive an accu- racy of 88.3%, which constitutes a clear improvement in comparison with the 33.3% baseline. When it comes to the individual performance of the different features, shown in table 6.15), we see that the best performing features are the subject (75.0%), genitive (71.7%) and object (68.3%) features. As in the ear- lier experiments, these features stand out with respect to the others as stable class predictors. The results indicate that organization nouns are distributionally different when it comes to their realization as subject, object and/or genitive modifier. We earlier remarked on the fact that the organization nouns were most alike the animate nouns with respect to these features in particular. How then is it possible that these also help distinguish them from the animate class? The cor- 6.6 Class granularity: classifying organizations 107 pus data show that the organization nouns have higher proportions of subjects and genitives, and lower proportions of objects than the animate nouns. Rather than indicating an intermediate animacy status, this in a sense makes them more animate than the animate nouns themselves. The next section examines the distribution of organization nouns over these three features in a bit more detail. 6.6.3 The distribution of organizations As mentioned earlier, organizations have a distribution which sets them apart from the strictly animate or inanimate nouns. The fact that these have been argued to occupy a middle ground with respect to animacy indicates that they have properties in common with both animate and inanimate nouns, see section 2.4 on gradience. As we shall see, this dual status is clearly reflected in their linguistic behaviour with regard to the chosen set of features. 6.6.3.1 Possessive relations Let us start by examining closer the linguistic behaviour captured in the gen- itive feature and how the three classes differ. The organization nouns have a significantly higher proportion of genitive case marking than the animate class (GEN:p<.0001). It has been claimed that animate and inanimate nouns differ in the types of possessive relations expressed by the genitive construc- tion (Rosenbach 2003). Animate nouns prototypically express ownership, e.g, guttens bil ‘boy’s car’, body parts, e.g., guttens arm ‘the boy’s arm’ and kin- ship terms, e.g., guttens far ‘the boy’s father’. Less prototypical relations for animate nouns are states (guttens tilstand ‘the boy’s condition’) and abstract possession (guttens liv ‘the boy’s life’). Inanimate nouns have a more lim- ited ranges of possessive relations available of which the part-whole relation (husets tak ‘the house’s roof’) is the most prototypical. A hypothesis which would support the intermediate status of organizations, is that they have avail- able the full range of possessive relations for both animate and inanimate ex- pressions, and as a consequence are more frequent in the genitive.92 In order get a more detailed picture of the genitive relations the organization nouns occur in, we performed a corpus study of 50 genitive occurrences for three nouns from each of the three classes, alltogether 450 genitive instances. 92The purpose of Rosenbach’s study is to investigate the genitive alternation in English. She excludes from her corpus study all kinds of collective nouns, thereby creating a strictly binary animacy opposition. 108 Acquiring animacy – experimental exploration Tag Description Genitive relation ProtoAnim Prototypical for ani- mate body parts , kinship terms and permanent/legal ownership ProtoInan Prototypical for inani- mate part-whole Non-Proto Non-prototypical for animate and inanimate states , abstract possession , non- part whole Nom Nominalizations subject of nominalized verb Table 6.17: Overview of annotation classes in the corpus study of genitive organiza- tion nouns. The nouns are listed below: (67) gutt ’boy’, kvinne ’woman’, president ’president’ (68) bank ’bank’, bedrift ’company’, kommisjon ’commission’ (69) bil ’car’, fly ’plane’, hus ’house’ The nouns and sampled corpus for the animate and organization classes were chosen randomly. However, since many of the inanimate nouns are rare in the genitive, we chose nouns which denote internally structured entities, which in principle can express the part-whole relation claimed to be the prototypical relation for this class (Rosenbach 2003).93 We annotated the resulting corpus of genitive constructions according to the distinctions made in Rosenbach 2003, but collapsed the non-prototypical categories for the animate and inanimate classes, covering states, abstract pos- session and non-part-whole relations.94 We also distinguished nominalizations as a separate class.95 The classes we annotated the genitives for are presented in table 6.17. Nominalizations are closely linked to the argument structure of a verb (Grimshaw 1990), a dimension where animacy is clearly an important factor, see sections 3.5 and 3.6.2.96 We annotated as nominalizations all constructions 93Not all of the inanimate nouns had as many as 50 genitive occurrences, which is why the results are given as relative frequencies rather than absolute ones. 94Non-prototypical genitive relations for inanimate nouns are all relations which are non part-whole, e.g. husets skjebne ‘the house’s destiny’. 95Rosenbach (2003) excludes genitive constructions where the head is a nominalization from her study, however, we chose to examine these as well. 96Grimshaw (1990) distinguishes between two types of nominalizations – result nominal- izations and complex event nominalizations, where only the latter actually has an argument 6.6 Class granularity: classifying organizations 109 ProtoAnim ProtoInan Non-Proto Nom Noun class Mean SD Mean SD Mean SD Mean SD Animate 42.7 5.7 n/a n/a 38.7 9.3 17.3 5.0 Organization 10.7 4.1 21.3 3.7 22.0 7.5 44.7 10.0 Inanimate n/a n/a 60.6 16.4 38.7 17.2 0.0 0.0 Table 6.18: Mean percentages and standard deviations for the noun classes in dif- ferent genitive relations from the corpus study of 450 genitive construc- tions. where the head noun was clearly derived from a verb and where the genitive expressed the subject of such a verb. These may or may not express any addi- tional arguments overtly, see (70) and (71) respectively: (70) . . . bystyrets . . . city-council-GEN vedtak decision om about å to slippe pass C02-avgiftene CO2-taxes-DEF ‘The city council’s decision to pass the C02 taxes’ (71) . . . anvendt . . . used i in forhold relation til to bystyrets city-council-DEF.GEN prioriteringer priorities ‘. . . used in relation to the city-council’s priorities’ Nominalizations may also take the form of compounds, where the non-head expresses the object of the nominalized verb:97 (72) . . . bystyrets . . . city-council’s boikottvedtak boycott-decision ‘the city-council’s decision to boycott’ The results from the small corpus study are presented in table 6.18 where we present the aggregated means of the three nouns from each class for a total of 450 genitives. We find that the animate and inanimate nouns follow the patterns predicted in Rosenbach 2003, with a greater percentage of prototypical class structure. The genitive in the former case is simply a modifer whereas in the latter case, it is a suppressed subject and the other arguments (if any) are obligatory. Many nominalizations are ambiguous between the two readings, and Grimshaw posits seveal tests which elucidate the dif- ference. For instance, adverbials which relate to event structure, such as constant and frequent, force an event reading, and hence require the internal arguments to be expressed: (i) The constant assignment *(of unsolvable problems) is to be avoided. 97We did not attempt any systematic disambiguation with respect to result vs. event nomi- nalizations (Grimshaw 1990). This distinction seems difficult to apply in practice and it is not clear that the tests proposed hold also for Norwegian. 110 Acquiring animacy – experimental exploration usages than non-prototypical. However, for the animate class this difference is not significant and we see a large proportion of more abstract possessive relations, e.g. gutters liv ’boys’ life’, kvinners status ’women’s status’, presi- dentens moral ’the president’s morale’. It is also apparent that the organization nouns in the study do in fact occur with all the possible possessive relations, i.e. the relations typically associated with both animate and inanimate nouns: • Possession: (73) . . . bankens sedler og mynter ’the bank’s bills and coins’ • Abstract/state: (74) . . . bedriftens lønnsomhet ’the company’s equity’ • Nominalization: (75) . . . bankens vedtak ’the bank’s decision . . . ’ • Part-whole: (76) . . . kommisjonens leder ’the commission’s leader’ It is interesting to note that it is not the prototypical animate relations which dominate the organization usage, but rather the nominalizations. This obser- vation fits nicely in with the fact that organizations function so frequently as transitive, main clause subjects. However, it is also clear that the prototypical inanimate part-whole relation accounts for a fair portion (21.3%) of the geni- tive usages of organizations. Our initial hypothesis is thus supported; we find that organizations may oc- cur with the posseessive relations associated with both animate and inanimate nouns, hence have a higher frequency of genitives. This confirms their inter- mediate status. 6.6.3.2 Subjecthood Organizations occur significantly more often as transitive subjects than ani- mate nouns, and subjecthood turned out to be a successful predictor in our three-way classification. This is not that surprising, as organizations are per definition decision-making entities, hence would be expected to exhibit a high level of agentivity. If we examine the lexical verbs which occur most often in 6.6 Class granularity: classifying organizations 111 our material with organization subjects we find the following list, sorted by frequency:98 (77) ha ‘have’, gi ‘give’, være ‘be’, fastsette ‘determine’, foreslå ‘suggest’, mene ‘mean/think’, få ‘get’, legge ‘lay’, gjøre ‘do’, ta ‘take’, vurdere ‘assess’, se ‘see’, finne ‘find’, styre ‘govern’, bestemme ‘decide’, foreta ‘perform’, understreke ‘underline’, kreve ‘demand’, anta ‘assume’, bruke ‘use’, bli ‘become’, anbefale ‘recommend’, si ‘say’, betale ‘pay’, kunne ’know’, ønske ‘wish’, stille ‘ask’, utarbeide ‘develop’, sette ‘set’ The majority of the verbs in (77) are clearly agentive verbs which denote events of decision-making and opinionating. Furthermore, we find that the verb ha ‘have’ is the most frequent verb for the organization nouns. In a study of Swedish, Dahl and Fraurud (1996) find that inanimate transitive subject are to a large extent subjects of the verb ha ‘have’. Some examples from our study of organization nouns are provided below: (78) Banken bank-DEF har has innskudd deposits i in Rogaland Rogaland ‘The bank has deposits in Rogaland’ (79) Administrasjonen adminstration-DEF har has ansvaret responsibility-DEF for for . . . . . . ‘The administration are responsible for . . . ’ (80) Bedriften company-DEF har has for too svake weak eiere owners ‘The company has too weak owners’ (81) Foreningen association-DEF har has 100 100 medlemmer members ‘The association has 100 members’ We see that the construction is clearly compatible with inanimate nouns, as it expresses a part-whole relationship as in (80)–(81), as well as possessive ownership as in (78) and abstract state in (79).99 The distribution of organizations as subjects is compatible with the find- ings of the corpus investigation for the genitive construction. Organizations are frequent in this construction as they may occur in both animate and inan- imate guise. They may occur with clearly agentive verbs, as well as with the possessive verb ha ‘have’ to mark a part-whole reading. 98These are lexical head verbs only, i.e. verbs that occur either as the single finite verb in a clause or as non-finite participle along with a finite auxiliary. 99In the example in (79) ha ‘have’ is actually more like a light verb which together with its complement forms the complex predicate ‘be-responsible’. 112 Acquiring animacy – experimental exploration 6.6.3.3 Objecthood Organization nouns occur infrequently on average as direct objects, in fact, significantly less often than regular animate nouns (p<.0001). In order to get a better idea of the differences in direct object realization between the three classes, we performed a corpus study. From each of the three classes three randomly chosen nouns were selected and we extracted 80 random examples of object occurrences for each of these.100 The three nouns are presented in (82)–(84): (82) jente ’girl’, lege ’doctor’, president ’president’ (83) departement ’ministry’, kommisjon ’commission’, regjering ’government’ (84) hus ’house’, opplysning ’(piece of) information’, penge ’coin’ The corpus studies examining the distribution of organizations in genitive and subject constructions examined more fine-grained semantic relationships ex- pressed by these syntactic constructions and we examined differences in fre- quency distributions between the three classes. Objects are typically patients or themes, generally expressing an affected participant in the event denoted by the verb (Dowty 1991). When annotating the data set, we distinguish between regular direct objects and objects which have a dual status in that they are also logical subjects of a following active subordinate clause. We earlier assumed that the organizations occupied an intermediate position on the animacy hier- archy which gave them a flexibility with respect to the genitive construction. When functioning as objects, this duality or intermediacy may be reflected in the types of structures where an argument at one and the same time stands in a thematic relationship with two different verbs, i.e. functioning as subject and object at the same time. In the annotation we adopted the following definitions and annotated for two classes: Regular object the noun is a direct object of a transitive verb and is not a subject of a following subordinate clause Dual object the noun is a direct object but is also the logical subject of an ensuing subordinate clause 100We chose to extract 80 examples as there turned out to be quite a bit of noise in the tagger’s analysis of objects and we wanted to ensure a reasonable amount of data. Simple errors in the automatic analysis were manually filtered out from the annotation, leaving only the real direct objects for analysis. This amounted to a total of 406 object instances. 6.6 Class granularity: classifying organizations 113 Note that we are not requiring the dual objects to be logical objects of the verb, only structural ones. This allows us to include so-called ECM or raising-to- object constructions, where the object is not a thematically entailed argument of the matrix verb, as well as cleft constructions where the structural object is the logical subject of both the matrix and subordinate clause. Examples of the types of constructions included in this annotation category are given in (85)– (89) below. (86) and (85) provide examples of an ECM and control construc- tion respectively, and in (87) we have an example of a cleft construction with a copula verb, an expletive subject and an object which is the subject participant of the obligatory relative clause. In (88) the object argument is modified by a subject-relative clause, whereas it in (89) forms a small clause construction.101 (85) De they anklagde accused kommisjonen commission-DEF for for å to utnytte abuse sin their/its posisjon position ‘They accused the commission of taking advantage of their/its position’ (86) De they fikk got kommisjonen commission-DEF til to å to utnytte abuse sin their/its posisjon position ‘They made the commission take advantage of their/its position’ (87) Det it er is kommisjonen commission-DEF som that utnytter abuses sin their/its posisjon position ‘It is the commission that takes advantage of their/its position’ (88) De they tilhørte belong kommisjonen commission-DEF som that utnyttet exploited sin their/its posisjon position ‘They belong to the commission that took advantage of their/its position’ (89) De they anså considered kommisjonen commission-DEF som as suspekt suspicious ‘They considered the commission to be suspicious’ Table 6.19 shows the mean frequencies, along with standard deviations, of regular direct objects and objects which are logical subjects of following sub- ordinate clauses in the data collected for the nouns from each class. The or- ganization nouns are clearly more frequent in the dual object position. This relates to the fact that they occur so frequently as subjects, and most often as agents, as we saw in the above section. The subordinate verbs in the dual object constructions are overwhelmingly agentive: 101The examples in (85)–(89) are constructed. 114 Acquiring animacy – experimental exploration Regular object Dual object Noun class Mean SD Mean SD Animate 79.4 6.4 20.6 6.4 Organization 53.3 6.7 46.7 6.7 Inanimate 98.3 1.5 1.7 1.5 Table 6.19: Mean percentages and standard deviations for the noun classes in differ- ent object relations from the corpus study of 406 objects. (90) Det it er is dette this departementet department-DEF som that skal shall ta take avgjørelsen decision-DEF ‘It is this department that will make the decision’ (91) Kina China hadde had tidligere earlier bedt asked den the danske Danish regjeringen government-DEF om å to avlyse cancel besøket visit-DEF ‘China had earlier asked the Danish government to cancel the visit’ In comparison, the animate nouns clearly more frequently occur as regular transitive objects, with almost 80% of the animate object occurrences in the corpus data. The inanimate nouns hardly have any dual object constructions at all (3 instances, 1.7%) and the ones we find consist of a predicative subor- dinate clause, as in (92), or are examples of metaphorical extensions through anthropomorphization, as in (93): (92) De they har have plikt duty til å to gi give alle all opplysninger information som that er is nødvendig necessary ‘They are obliged to provide all information necessary’ (93) Men but det there finnes exists et one og and annet another hus house som that eier owns sin its eier owner ‘But there are houses that own their owners’ In section 3.6.2 above we discussed the notion of individuation, which is of- ten mentioned as one which either intersects with animacy (Dahl and Fraurud 1996; Yamamoto 1999) or which subsumes animacy along with other prop- erties such as definiteness and referentiality. The level of inidviduation relates to the degree to which we view an entity in the discourse as being a “clearly delimited and identifiable indiviual” (Dahl and Fraurud 1996). Animates will 6.6 Class granularity: classifying organizations 115 tend to be high in individuation, hence making them well-suited as objects (Hopper and Thompson 1980).102 Organizations are referentially mass-like and abstract. They do not point out a clearly delimited individual, however are often in definite form, e.g., regjeringen, byretten, kommisjonen ‘the govern- ment, city-court, commission’, because there is only one of them with respect to particular time and place. One point, which we have mentioned only in passing, but which relates to the dimension of individuation, is pronominal reference. Animate nouns have a strong tendency for reference by a personal pronoun, whereas this is lower for inanimates (Dahl and Fraurud 1996). Our data on the pronominal features (ANAAN, ANAIN) have measured the pronominal reference by personal pro- nouns, i.e., the pronouns which clearly show the animacy of their referent. An interesting property of the organizations is that they, like the inanimate nouns, are very infrequently referred to by means of an animate personal pronoun. That is, even though the type of sequence in (94) is perfectly grammatical, it is hardly encountered:103 (94) Komiteeni committee-DEF ankom arrived i this formiddag. morning. Dei They ville wanted . . . . . . ‘The committee arrived yesterday. They wanted . . . ’ In this respect, the organization nouns differ distinctly from the animate nouns (p<0.0005) and behave more like the inanimate nouns. This indicates that they are not individuated enough to merit pronominal reference. 6.6.4 Conclusion The results from a set of classification experiments indicated that organizations behave linguistically in a manner which clearly sets them apart from regular animate or inanimate common nouns, a behaviour which can be exploited in automatic classification. With reference to the morphosyntactic SUBJ, OBJ and GEN features, we investigated the nature of this difference in terms of more fine-grained syntactic and semantic distinctions. In particular we found empir- ical evidence for the intermediacy of organizations, where these nouns may take on both animate and inanimate readings. There is also a clear agentive pattern which emerges in the distribution of these nouns; they occur frequently 102Note that the predictions of Hopper and Thompson (1980) with respect to direct objects are contrary to those predicted by prominence-conserving theories of argument realization, such as Aissen 2003. On this view, animate entities are marked objects precisely because they are prominent. 103The example in 94 is constructed. 116 Acquiring animacy – experimental exploration as transitive subjects, have a large proportion of nominalizations in the genitive and even as objects, often dually hold an agentive role in a following subor- dinate clause. It is in the very semantic nature of organizations that they are action-taking and decision-making entities, so it should not be surprising that this shows up in their linguistic behaviour. What makes them interesting from the point of view of animacy, is that they exhibit a duality compatible with both extremes of the animacy dimension – highly individuated and agentive or less individuated, mass-like and internally structured. 6.7 Unsupervised learning as class exploration We have seen in the above sections that the chosen set of linguistically moti- vated distributional features approximate the property of animacy well. Section 6.6 investigated how the addition of organization nouns to the data set influ- enced the classification results and also investigated the proposal that these nouns constitute an intermediate animacy category empirically, both through machine learning and corpus studies. In the following we will apply the unsu- pervised machine learning technique of clustering, see section 5.2.3. We em- ploy clustering primarily as a technique for data exploration (Boleda, Badia and Batlle 2004; Boleda 2007) and thereby provide an additional perspective on the task of animacy classification. The main goal is to examine the cate- gories which, under a distributional view of the nouns and based on our se- lected features, emerge when we do not classify according to a predefined set of classes. We also assess whether the unsupervised categories correspond in any systematic manner with the categories employed in our earlier, supervised experiments. 6.7.1 Experiment 8: Clustering Experiment 8 investigates how the individual nouns cluster based on their dis- tributional properties and examine the two levels of granularity discussed ear- lier. The advantage of employing an unsupervised technique is that there is no bias towards a predefined set of classes, but rather a direct focus on the properties of the nouns. 6.7.1.1 Experimental methodology We employed the same data sets as in Experiment 1 and 7, i.e. the set of high- frequency nouns with binary – animate, inanimate – classification, as well as 6.7 Unsupervised learning as class exploration 117 the data set with three-way – animate, inanimate and organization – classifi- cation. For clustering we employed the Cluto software with default settings, see section 5.2.3.104 Clustering is partitive, whereby a clustering solution is obtained by partitioning the data set into an increasingly larger set of clus- ters until the desired k number of clusters is obtained. At each partitioning of the vector space, a criterion function is optimized. We employ an internal criterion function which maximizes the inter-cluster similarity of each clus- ter, where similarity is computed with the cosine function.105 The parameter which is varied in the experiments is the k-parameter which specifies the de- sired number of clusters. 6.7.1.2 Overview and evaluation of cluster solutions When a clustering solution has been obtained for a data set, it must also be pre- sented in a manner which provides an overview of the content of each cluster. There are several different cluster properties which in various ways provide a summary of a cluster solution. We focus on the following:106 Internal quality: The tightness of a cluster can be obtained by looking at the average internal similarities of the nouns contained in the cluster and overlap by looking at the average similarity of the elements in the cluster with the rest of the elements in the data set. These are internal quality measures, as they do not make use of class information which is external to the clustering solution. Features: A cluster may also be summarized by the features which were most important in obtaining the particular cluster solution: the descrip- tors are the features which contribute the most to the similarities of the instances in the cluster and the discriminators are features which con- tribute the most in distinguishing the cluster elements from the total set of instances. The above measures do not take into account any predefined classification of the instances. However, since our data set does contain classified instances, we may also take these into account when evaluating the cluster solution. We 104The default settings in Cluto are: clustering by repeated bysections (rb) with the criterion function I2. 105To be precise, the criterion function maximizes the similarity between each member of a cluster and the centroid vector of the cluster, which is obtained by averaging over the vectors in the cluster. 106The tightness and overlap of a cluster solution corresponds to the ISim and ESim measures of Karypis (2002). 118 Acquiring animacy – experimental exploration Size Tightness Overlap Anim Inan 1 1 1.0 0.5 0 1 2 21 0.97 0.77 2 19 3 18 0.97 0.77 18 0 Table 6.20: Cluster solution with best internal quality for the high-frequency animate-inanimate data; ordered by decreasing tightness−overlap. thus examine information that was not employed during clustering, hence is external to the clustering as such. The purity of a clustering solution measures the degree to which a proposed cluster contains instances of the same class, and gives the proportion of cluster elements which are of the majority class. For a given cluster Sr of size nr, we have that (Zhao and Karypis 2003): Purity(Sr) = 1 nr max i (nri ) where nri is the number of instances assigned to the ith class, assigned to the rth cluster. The purity of the entire cluster solution is computed as the weighted sum of the purities of the individual clusters. 6.7.1.3 Anim-Inan data A clustering experiment was performed on the data set consisting only of high- frequency animate and inanimate nouns, where the number of clusters was varied: k = {2,3,4,5,6}. A 3-way clustering solution obtained the best inter- nal quality, i.e. in terms of average tightness and overlap of the clusters. Table 6.20 shows the clustering solution, where each row represents a cluster. The purity of the solution is 0.95, which is high. We find that the cluster solution with the best internal quality has two clusters which roughly correspond to our classes of animate (cluster 3) and inanimate (cluster 2), and an additional cluster which consists of only one element, namely the noun dag ‘day’ (clus- ter 1). We find that the genitive feature was the descriptive and discriminating feature for this cluster. Temporal expressions are often mentioned in work on genitive constructions (Rosenbach 2002), because they behave unlike other inanimate nouns in this respect. The noun dag ‘day’ has an unusual high pro- portion of genitive occurrences (0.15) compared with the average inanimate noun (0.02).107 Cluster 2 consists primarily of inanimate nouns, as well as the 107The noun dag ‘day’ has a high proportion of genitive instances also compared to another temporal noun in the data set, namely night ‘natt’. In the data for these two we find the expected 6.7 Unsupervised learning as class exploration 119 Size Tightness Overlap Anim Inan 1 62 0.92 0.62 56 6 2 58 0.91 0.62 4 54 Table 6.21: 2-way cluster solution for the high-frequency, ∼100 and ∼50 animate- inanimate data; ordered by decreasing tightness−overlap. two animate nouns barn ‘child’ and venn ‘friend’. We noted already that the useful features of subject and objecthood do not give sufficient distributional evidence for nouns of this type. Children and friends are typically entities that we possess, and are not that frequently agentive. We find that for this clus- ter, the highest ranked descriptive features are the object, subject and genitive features. In order to evaluate the effect of sparse data on the clustering, an identi- cal experiment was performed on the entire data set of animate and inanimate nouns, i.e. the concatenation of the data on high-frequency nouns of absolute frequencies ∼100 and ∼50. The cluster solution is presented in table 6.21 and we find that the results largely corroborate the ones obtained in the super- vised classification experiments – feature back-off enables a good distinction between the two classes. A two-way clustering yields a total purity of 91.7, where the SUBJ and OBJ features are the primary features employed. 6.7.1.4 Anim-Org-Inan data A similar clustering experiment with k = {2,3,4,5,6} was performed on the data set containing nouns from the three classes of animate, organization and inanimate. Whereas the optimal clustering solution according to the internal quality metrics for the binary classification also obtained the highest external quality, i.e. purity, this is not the case for the ternary (Anim-Org-Inan) data. The clustering solution with k = 2 in fact achieves the best internal quality, but the lowest purity. Table 6.22 shows the clustering solution, where each row once again represents a cluster. The purity of the solution is 0.65. The cluster solution once again supports the main distinction between an- imate and inanimate nouns, but also gives an indication of gradience. Cluster construction where the genitive expresses the temporal situation of the head noun, e.g. dagens møte ‘the day’s meeting’, nattens match ‘the night’s match’. However, dag ‘day’ also occurs in a more general, fixed expression meaning something like ‘the current state of’, as in dagens samfunn/skole/generasjon ‘the current state of society/schools/generation’. This usage is dom- inant in the data for this noun and contributes to explain the difference in distribution between these two, otherwise semantically similar nouns. 120 Acquiring animacy – experimental exploration Size Tightness Overlap Anim Org Inan 1 23 0.94 0.66 4 0 19 2 37 0.91 0.66 16 20 1 Table 6.22: 2-way cluster solution with best internal quality for the anim-org-inan data; ordered by decreasing tightness−overlap. Size Tightness Overlap Anim Org Inan 1 23 0.94 0.66 4 0 19 2 18 0.93 0.69 0 17 1 3 19 0.97 0.82 16 3 0 Table 6.23: 3-way cluster solution for the anim-org-inan data; ordered by decreasing tightness−overlap. 1 contains all the inanimate nouns excluding the aforementioned dag ‘day’. In addition, four animate nouns are clustered with the inanimate nouns. The feature descriptors for this cluster lists the OBJ feature as being of particular importance, and, indeed, the cluster contains the aforementioned barn ‘child’ and venn ‘friend’. We find that all the organization nouns and a majority of the animate nouns have been assigned to one cluster (cluster 2). The most important features for the creation of the clusters were primarily the subject and object features. It is clear that the linguistic behaviour of the organization nouns captured in these features is more distinct from the inanimate nouns than the animate nouns themselves. It is not surprising then, that the organization nouns form the basis for a cluster along with a majority of the animate nouns. The cluster solution for k = 3 is presented in table 6.23. This solution has a sligthtly lower internal quality, but better purity (0.87). We find that cluster 1 is identical to cluster 1 in the k = 2 experiment presented in table 6.22. The two additional clusters have been created by splitting the initial cluster 2 into two separate clusters (clusters 2 and 3 in table 6.22). We find that the classes of organization and animate correspond fairly well with this partitioning of the instances; cluster 2 consists primarily of organization nouns, whereas cluster 3 consists in majority of animate nouns. Even so, the internal quality measure of overlap indicates that these two clusters are very similar, hence the internal quality of the cluster solution deteriorates. We stated initially that the main goal of this section was to employ unsuper- vised machine learning for data exploration. The results have clearly indicated 6.8 Summary of main results 121 the existence of a distinction between animate and inanimate nouns in the data itself. This shows that the two classes are natural and not simply superim- posed on the data. Just like the supervised experiments and the corpus studies, however, the clustering experiments emphasize the gradience of the animacy dimension. The granularity of the animacy dimension has been addressed by looking at a set of organization nouns. The clustering experiments pick up on the same trend as shown in the preceding sections; the organization nouns are more similar to the animate nouns than the inanimate nouns but constitute a group in the sense that they have a common set of distributional properties. As the corpus studies in 6.6.3 showed, however, our features are morphosyntactic approximators of more fine-grained syntactic and semantic distinctions, where the organization nouns further confirm their intermediate status. 6.8 Summary of main results At the beginning of this chapter we formulated a set of research question to be addressed. The general viability of a method for animacy classification has been addressed throughout the work described above. We have seen that an- imacy may be acquired through a set of morphosyntactic features capturing the morphosyntactic distribution of nouns and emphasizing the clear corre- lation between syntax and semantics with regard to animacy. We formulated a set of features which approximate linguistic correlations between animacy and distinctions in argumenthood, agentivity and individuation. We also tested the importance of the various features in classification and obtained results that show the importance of animacy in argument differentiation. We found that the SUBJ and OBJ features were central predictors of animacy through- out the above sections. With respect to the generalizability of the method, we examined the robustness of classification in the face of sparse data. It is not surprising that sparse data affects a method which relies on distributional fea- tures negatively, and this was established for our method as well. However, we found that the classification accuracy obtained for high frequency nouns (with absolute frequencies >1000) can be maintained for nouns with considerably lower frequencies (∼50) by backing off to a smaller set of features at classi- fication. We also examined the generalizability of the method across machine learning algorithm. The switch from decision-trees to memory-based learning gave slight improvements and highlighted general differences between eager and lazy learners. We also looked at unsupervised learning and found that the same class distinctions were made without a set of supervised training exam- ples. Finally the issue of gradience in the animacy dimension was approached through experiments with class granularity and a more fine-grained, three-way 122 Acquiring animacy – experimental exploration classification was tested empirically. The results underline the main opposition between animate and inanimate, however also show that finer distinctions may be approached empirically through data-driven animacy classification. The experiments described above were geared primarily towards the use of machine learning to evaluate the theoretical proposals regarding animacy and argumenthood in particular. In order to make a detailed study feasible, a small set of manually selected nouns were employed. A uniform distribution of classes was maintained in the data in order to ensure sufficient data for classification. However, both of these assumptions present clear idealizations. A natural next step would be to test the method developed in the current chapter further, by applying it to a larger set of nouns and evaluate the extent to which it is scalable. This is the topic of chapter 7. 7 ACQUIRING ANIMACY –SCALING UP The experiments reported in chapter 6 allowed us to explore several topics re- lated to animacy classification, such as feature selection, data sparseness and class granularity with a manually selected set of nouns. This chapter reports on experiments dealing with the scaling up of animacy acquisition and in do- ing so assessing the generalizability of the methods described in the previous chapter. In order to apply the supervised learning methods tested in the previ- ous chapter, we need a set of nouns annotated for animacy. In section 7.1, we will examine and assess annotation schemes for animacy and discuss the anno- tation for person reference found in a Swedish treebank. Section 7.2 presents the resulting data set and discusses data representation in terms of features. In section 7.3, we will describe a set of classification experiments on the resulting data set. We address the following questions: Animacy annotation Which criteria may be employed for annotation of an- imacy? Which properties should animacy annotation have in order to support lexical acquisition? Transfer of method Will the general classification method and features trans- fer to another data set and to a different, although closely related, lan- guage? Robustness revisited To what extent is classification robust to data sparse- ness? Class distribution How will a non-uniform distribution affect the results? Feature importance Which features are important in the scaling up of ani- macy classification? Machine learning algorithm revisited Do we observe any significant differ- ences between eager and lazy machine learning algorithms in animacy classification? 124 Acquiring animacy – scaling up Class granularity Can we find evidence for gradience of animacy in human annotation for this property? Can we find evidence for gradience of ani- macy in the experimental results? 7.1 Obtaining animacy data This section examines methods for obtaining data on animacy. We will start out by discussing criteria for animacy annotation with basis in previous annotation schemes and move on to present the manually annotated data in the Swedish treebank Talbanken05.108 7.1.1 Animacy annotation Annotation for animacy is not a common component of corpora or treebanks. However, following from the theoretical interest in the property of animacy, as discussed in chapter 3, there have been some initiatives directed at animacy annotation of corpus data. In the following, we present an annotation scheme developed for English and a small annotation study aimed at testing the scheme for Swedish. 7.1.1.1 Annotation schemes Corpus studies of animacy (Yamamoto 1999; Dahl and Fraurud 1996) have made use of annotated data, however they differ in the extent to which the annotation has been explicitly formulated as an annotation scheme. The an- notation study presented in Zaenen et al. 2004 makes use of a coding manual designed for a project studying genitive modification (Garretson et al. 2004) and presents an annotation scheme for animacy, illustrated by figure 7.109 The main class distinction for animacy is three-way, with subclasses under two of the main classes: • Human (HUM) • Other animate: Organizations (ORG), Non-Human Animates or Animals (ANIM) 108All examples in this section are taken from the Talbanken05 treebank. 109The fact that the study focuses on genitival modification has clearly influenced the cat- egories distinguished, as these are all distinctions which have been claimed to influence the choice of genitive construction. For instance, as mentioned earlier in section 6.7.1, temporal nouns are frequent in genitive constructions, unlike the other inanimate nouns. 7.1 Obtaining animacy data 125 ANIM CONC NCONC TIME PLACE ORG HUM InanimateOtheranimate Figure 7: Animacy classification scheme. • Inanimate: Concrete (CONC), Non-Concrete (NCONC), Time (TIME), Place (PLACE) The ‘Other animate’ class further distinguishes Organizations and Animals. Within the group of inanimates, a distinction is made between concrete and non-concrete inanimate. The concrete class is employed when the markable refers to “‘prototypical’ concrete objects or substances. Excluded are things like air, voice, wind and other intangibles. Body parts are concrete” (Zaenen et al. 2004: 4). The non-concrete class is the default class, and is employed for markables that refer to entities that are not prototypically concrete but clearly inanimate. This includes events, abstract concepts or generalizations. Place and time expressions are also distinguished within the main category of inanimate. 7.1.1.2 Annotation study A small annotation study for Swedish was performed on the Talbanken05 ma- terial in order to test the scheme proposed in Garretson et al. 2004 and get an overview of the distribution of the different classes. In order to do so, we anno- tated a semi-random sample from the prose section of Talbanken05 consisting of 108 sentences, and 383 markables.110 The markables in the study include all common nouns in the sample, with a few exceptions.111 The resulting distribution of annotated markables over the classes is pre- sented in table 7.1. The ‘non-concrete’ category is in clear majority (61.4%), followed by the ‘Human’ class (16.2%). Due to this, the main category ‘Inani- mate’ is also in clear majority, accounting for 78.6% of the markables. We find 110To be precise, every 60th sentence was extracted from the prose section of Talbanken05 and annotated. 111All common nouns were annotated with the following exceptions: first conjuncts in abbre- viated compound coordination constructions of the type ‘N- och NN’, e.g. familje- och stats- budgeten ‘family- and state budget’, parts of functional multiword units, e.g. på grund av ‘for reasons of’ and quantifying nouns, e.g. en radDET förmåner ‘a row (of) benefits’ 126 Acquiring animacy – scaling up Class # % Sub-class # % HUM 62 16.2 HUM 62 16.2 Other animate 20 5.2 ORG 13 3.4 ANIM 7 1.8 Inanimate 301 78.6 CONC 40 10.4 NCONC 235 61.4 TIME 13 3.4 PLACE 13 3.4 Tot 383 100.0 383 100.0 Table 7.1: The distribution of markables over the different classes and sub-classes in the annotation study. that the intermediate category ‘other animate’ is quite infrequent, accounting for only 5.2% of the nouns in the sample, with animals at 1.8% and organiza- tions at 3.4%. 7.1.1.3 Reference as annotation criterion In the pilot annotation study we followed the annotation scheme described in Garretson et al. (2004): to annotate the markables according to the animacy of their referent in the particular context. However, using reference as a criterion can be problematic. First of all, by doing so one implicitly assumes that all markables refer and hence have a determinable referent. Secondly, by taking a context-dependent view of animacy, there is a danger that the resulting anno- tation does not deal with animacy at all, but rather a context-dependent notion of individuation or accessibility. We will examine these issues in turn below. Garretson et al. (2004) state that “when coding for animacy [. . . ] we are not considering the nominal per se (e.g., the word ‘church’), but rather the entity that is the referent of that nominal (e.g. some particular thing in the real world)”. This indicates that for all possible markables, a referent should be determinable. In the annotation of the Swedish sample, however, it became clear that this assumption is problematic. In (95) below, we find an example of person-denoting nouns with generic readings. (95) Hyressättningen rent-setting-DEF grundas built på on avtal agreement mellan between hyresvärd landlord och and hyresgäst tenant ‘The rent is based on an agreement between landlord and tenant’ 7.1 Obtaining animacy data 127 The referent of a generic reading differs from a specific one in being a ‘ref- erence to kinds’ (Carlson 1980). In a very narrow interpretation of reference, one may want to exclude generic readings completely. However, it is not the case for these that the animacy of the markables may not be determined. Another problematic area with regard to reference deals with noun phrases which incur a predicational reading, e.g. (96)–(97) below: (96) Det it är is en a uppöver up-over öronen ears förälskad in-love flicka girl ‘That is an utterly infatuated girl’ (97) Han he är is representant representative för for Svenska Swedish Kyrkan Church-DEF ‘He is a representative for the Swedish Church’ Both of the examples in (96)–(97) are descriptive predicatives, which serve to classify or characterize the predicated argument further. These types of pred- icatives may be employed clearly referentially in Swedish with an indefinite article and often with a deictic argument (Teleman, Hellberg and Andersson 1999), as in (96). It is clear that the lack of indefinite article “dereferences” the predicative: (98) Det that där there är is *flicka/*representant. girl/representative For these classificational predicatives, the referent is rather a generic role, as in (97) above. One might claim that rather than being referential, these express a predication which concerns the subject and hence are propositional in nature. The above discussion of generics and predicatives illustrates that relying on reference as a criterion for annotation can be problematic. This brings us to our second problem with the annotation principle of reference. If one assumes that a reference may be determined for all markables, there is risk that the no- tion of animacy becomes diluted. In particular, such an approach confuses with animacy a range of related factors such as definiteness or individuation. Infor- mation to this end is present in the choice of NP-type, the formal definiteness of the NP, its abstractness and accessibility in the discourse. Additional anno- tations expressing these types of information are possibilities which might be explored, but, should possibly be kept separate from the animacy dimension. The above discussion ties in with the proposal in chapter 3 that animacy is largely a denotational property of nouns. Whereas reference may vary with the linguistic context, denotational properties are stable across contexts. In chapter 6, this assumption lead us to the hypothesis that aggregated frequency data, i.e. 128 Acquiring animacy – scaling up data collected at the level of lemmas, could be exploited in animacy classifica- tion. 7.1.2 Person reference in Talbanken05 The Swedish treebank Talbanken05, see section 5.1.1, expresses a distinction for nominal elements between reference to person and non-person in the lexi- cal layer of its annotation. The annotation manual (Teleman 1974) states that a markable should be tagged as person if it may be replaced by the interrogative pronoun vem ‘who’ and be referred to by the personal pronouns han ‘he’ or hon ‘she’. This goes for singular markables, whereas for their plural counter- parts, the instruction is to annotate them as one would their singular forms. The following describes the annotation in a bit more detail. 7.1.2.1 Annotation scheme As mentioned earlier, the annotation in the original Talbanken (the MAMBA scheme) consists of a column-based markup, where two main layers may be distinguished - a lexical and a syntactic one (Teleman 1974). The annotation for the distinction between person and non-person reference is found in the lexical layer, along with information about part-of-speech and varying other types of semantic information, depending on the part-of-speech in question, see section 5.1.1. The person/non-person distinction is marked for the following parts-of- speech: • Nouns: common (NN), proper (PN), meta nouns (MN), adjectival (AN) and verbal (VN) • Pronouns (PO) • Adjectives (AJ) • Participles: present and perfect (SP/TP) • Others: indefinite article (EN), numerals (RO) The analysis found in the lexical layer ideally represents the type of infor- mation that is inherent for the word in question and hence non-contextual (Teleman 1974). For instance, the part-of-speech category of pronouns (PO) 7.1 Obtaining animacy data 129 does not distinguish determiners from nominal heads in the lexical layer.112 With respect to the annotation for person reference, it is clear that the syntactic environment has been taken into account. Nouns are marked for personhood regardless of syntactic context, whereas pronouns, adjectives, participles, in- definite articles and numerals are marked for personhood only when they are heads in nominal phrases (Teleman 1974). (99)–(103) exemplify nominal el- ements with person annotation (HH) of various parts of speech – pronoun in (99), adjective in (100), participle in (101), indefinite article in (102) and nu- meral in (103): (99) De they som who tagits taken ut out till to underofficersutbildning under-officer-education . . . . . . ‘Those who have been chosen for the subordinate officer education . . . ’ (100) . . . att . . . to flytta move den the unge young från from hemmet home-DEF ‘. . . to move the young one from his/her home’ (101) Antalet number skadade injured var was 140 000 140 000 ‘. . . The number of injured was 140 000’ (102) En one ropar calls det the rytmiska rythmical ga-ga ga-ga ‘One calls out the rythmical ga-ga’ (103) År year 1970 1970 hade had ungefär approximately 700 000 700 000 förvärvsarbete gainful-employment ‘In the year 1970 approximately 700 000 had gainful employment’ Even though the annotation manual clearly states that only when function- ing as a nominal head should a pronoun, adjective or participle be annotated for personhood, we find examples where adjectives and participles functioning as genitive modifiers are annotated as persons: (104) . . . de . . . the försäkrades insured-GEN egna own sjukförsäkringsavgifter. . . health-insurance-fees. . . ‘. . . the health insurance fees of the insured’ 112A pronoun like de ‘the-PL/they’, for instance, is annotated with the part-of-speech PO re- gardless of whether it bears a nominal syntactic function, e.g. subject/object, or functions as a determiner, e.g. de inkomster ‘the incomes’. These are distinguished only in the syntactic anno- tation in terms of dependency relation. See section 4.1 for more on pronouns in Scandinavian. 130 Acquiring animacy – scaling up Since genitives have a reference which is independent of the nominal which it modifies, this decision seems reasonable. As mentioned earlier, all pronominal heads are annotated for personhood. The relative pronoun or marker som ‘who’ is analysed as a core argument in the relative clause (either subject or object) in Talbanken05 and always “inherits” the animacy from the head argument which it modifies. The manual treats collective nouns as non-persons, including examples like personalen ‘staff-DEF’, polisen ‘police-DEF’, domarkåren ‘judge corps’, folket ‘people-DEF’ (Teleman 1974). Animals are in general not treated as person referring, except in contexts where they are “anthromorphised” and may be referred to by the pronouns han, hon ‘he, she’ (Teleman 1974). 7.1.2.2 Person reference and animacy In section 7.1.1 above we discussed the annotation scheme employed in Zae- nen et al. 2004. There are clear similarities between the annotation for person reference found in Talbanken05 and the annotation for animacy. Regardless of annotation scheme, the person/non-person distinction can be viewed as form- ing the outer perimeters of the animacy dimension and, in this respect, the annotation schemes do not conflict. Following the above overview of the an- notation found in Talbanken05, we may compare it with annotation schemes for animacy, in particular the ones found in Garretson et al. 2004, and hence Zaenen et al. 2004), as well as Yamamoto 1999. We find that the schemes differ primarily in the granularity of classes distinguished and the types of markables which are annotated: • Classes: There is a partial overlap in classes between the person refer- ence annotation in Talbanken05 and the approaches that explicitly an- notate for the property of animacy. Garretson et al. 2004 contains the category Human, as well as Inanimate (at the top-level of annotation), which must be assumed to correspond to the person/non-person distinc- tion. The main source of variation in class distinctions consists in the an- notation of collective nouns, including organizations, as well as animals. Animals and organizations are treated as inanimate in the Talbanken05 scheme, whereas they form an intermediate category in Garretson et al. 2004. The Talbanken05 scheme is similar to Yamamoto 1999 in treating organizations as inanimate, but differs in not providing a more detailed treatment for animals. • Markables: Talbanken05 annotates slightly more markables than Ya- mamoto (1999) in also annotating for adjectives and participles as nom- 7.1 Obtaining animacy data 131 inal heads. Like Yamamoto, the Talbanken05 includes relative, interrog- ative and indefinite pronouns. Garretson et al. 2004 is not comparable in this respect as it only annotates genitive constructions, and Zaenen et al. (2004) do not state explicitly what the exact markables of their study are. We may conclude that the person/non-person distinction in Talbanken05 pro- vides a valuable source of data on animacy. First of all, it makes the main distinction which is common to all approaches to animacy and animacy an- notation – the distinction between human and inanimate. As the annotation study in 7.1.1 showed, organizations and animals are infrequent classes, hence we may assume that these do not disrupt generalizations regarding the class of inanimates in any significant way. Second, the annotation in Talbanken05 provides information regarding a wide range of markables including common and proper nouns, as well as pronouns. 7.1.2.3 The distribution of person reference in Talbanken05 In chapter 6 we examined distinctions in the distribution of animacy with re- spect to a set of theoretically motivated morphosyntactic features. In this sec- tion we approach this matter empirically and examine the general distribu- tion of person versus non-person referring nominals in Talbanken05. We focus largely on syntactic distribution and examine distinctions within the groups of argument, as well as non-argument, relations. A note on counts As explained above, for several parts-of-speech personhood is expressed only when these function as nominal heads. When comparing the distributions of persons and non-persons for parts-of-speech other than nouns, our population should hence only consist of nominal heads. However, ascertaining when a pronoun or an adjective is head of a nominal phrase is not completely straight- forward in a dependency annotation where there is no direct concept of phrases. In the following section, we approximate the notion of nominal head to head of a nominal dependency relation and define these to be the argument functions defined in section 5.1.1.113 As mentioned earlier, person reference is also rel- 113This is admittedly a simplification, nominal elements may certainly also have other func- tions, however, it is fair to assume that the argument functions are the functions which are predominantly nominal. 132 Acquiring animacy – scaling up Part-of-speech Person # N noun 7066 PO pronoun 9809 AJ adjective 280 P participle 57 R numeral 33 EN indef. pronoun 12 Total 17257 Table 7.2: Total number of tokens annotated as persons in the written sections of Talbanken05, broken down by part-of-speech. evant for genitival modifiers since these have an independent reference, hence we include these also in our overview. Overview Table 7.2 shows the absolute number of tokens annotated as person in the writ- ten sections of Talbanken05, broken down by part-of-speech. We find that per- son reference is most common for pronouns and nouns, which together account for 97.8% of the total person instances. Table 7.3 shows the distribution of person/non-person over nominal heads, also broken down by part-of-speech. In general, without discerning individual NP-types, we see that non-persons are more frequent in the corpus than per- sons. It is also clear, however, that the personhood or animacy dimension in- fluences referentiality and more specifically, the part-of-speech employed. As mentioned in section 3.4, persons are often referred to by a pronoun and we find that the percentage of persons is high for pronominal arguments (51.8%). Table 7.4 presents the distribution of person and non-person nouns and pro- nouns across various dependency relations in Talbanken, see table 5.3 in sec- tion 5.1.1.114115 There are some clear tendencies towards differences in distri- bution between the two categories (person/non-person) and we can ascertain that person and non-person referring nouns and pronouns differ significantly in their general syntactic distribution (p<.0000,d f = 19).116 114By limiting the overview to nouns and pronouns we ensure a comparison of nominal func- tions where person reference is possible. For instance, adjectival predicatives are not nominal and referential. Also, clausal complements are annotated as objects, however are not nominal and referential and should not be employed to compare the distribution of person vs. non-person in direct objects. 115Table 7.4 includes dependency relations which have more than 10 person instances. 116Pearson’s Chi-Squared test with Yates’ continuity correction with 19 degrees of freedom over a 2x20 matrix with rows=dependency relations and columns=person,non-person. 7.1 Obtaining animacy data 133 Person Non-Person Total # % # % # % Noun 5187 15.4 28421 84.6 33608 100.0 Pronoun 7596 51.8 7067 48.2 14663 100.0 Adj 206 6.7 2871 93.3 3077 100.0 Part 39 5.5 665 94.5 704 100.0 Num 30 7.4 375 92.6 405 100.0 Indef pro 10 16.7 50 83.3 60 100.0 Table 7.3: Absolute (#) and relative frequencies (%) of person/non-person nominal heads in the written sections of Talbanken05, broken down by part-of- speech. In section 3.1 we established a set of distinctions within the group of ar- gument relations and argued that animacy is a dimension by which arguments are differentiated. We may now test empirically whether different types of ar- guments differ with respect to person reference. The argument relations for which we find person referring elements are the subject (SS), indirect and di- rect object (IO, OO), subject and object predicative (SP, OP), as well as the log- ical subject (ES) relations.117 We find that indirect objects and subjects exhibit the highest percentages of person referring nominals: 87.5% and 44.8%, re- spectively.118 The percentage of person referring direct objects is clearly lower (21.2%). We have noted several places that subjects and direct objects tend to differ with respect to animacy. Dahl and Fraurud (1996) show that person NPs are more likely to occur as subjects of a transitive clauses than non-persons. The counts for subjects in our case contains all subjects, not only subjects of transitive verbs, however, we clearly see the same trend. In the Talbanken05 data, we find that the person reference of subjects and direct objects differ sig- nificantly (p<.0000).119 The core argument functions are subjects, objects and indirect objects and non-core are the rest of the argument functions, in this case: the group of predicative relations (SP, OP). We find that the core and non- core arguments also differ significantly with respect to the property of person 117The logical subject is a relation employed in conjunction with an expletive or formal sub- ject and denotes for instances demoted agents in presentational constructions or impersonal passives. See section 8.3.1 for more on the argument distinctions expressed in Talbanken05. 118The high percentage of person referring formal objects is a result of a “quirk” of the anno- tation, where the reciprocal pronoun själv ‘him/herself’ has been annotated as a formal object in examples like Jag tror själv att . . . ‘I, myself, think that . . . ’ 119Pearson’s Chi-Squared test with Yates’ continuity correction with 1 degree of freedom over a 2x2 matrix with rows=person,non-person and columns=binary argument distinctions; e.g. subject,object. 134 Acquiring animacy – scaling up Person Non-person Total # % # % # % SS subject 8385 44.8 10349 55.2 18734 100.0 PA prep. compl. 2355 14.4 14035 85.6 16390 100.0 DT determiner 2139 12.2 15330 87.8 17469 100.0 OO dir. obj. 1963 21.2 7281 78.8 9244 100.0 CC conjunct 733 17.9 3373 82.1 4106 100.0 IO indir. obj. 365 87.3 53 12.7 418 100.0 SP subj. pred. 235 11.3 1849 88.7 2084 100.0 AN apposition 130 17.9 596 82.1 726 100.0 HD head of idiom 121 26.8 330 73.2 451 100.0 ET post-nom. mod. 86 27.6 226 72.4 312 100.0 ROOT root 72 11.7 542 88.3 614 100.0 XX unclass. 66 25.0 198 75.0 264 100.0 FO formal obj. 64 44.4 80 55.6 144 100.0 ES logical subj. 60 16.0 315 84.0 375 100.0 KA comp. adv. 58 32.4 121 67.6 179 100.0 +F coord. clause 41 19.7 167 80.3 208 100.0 AA adv. 13 9.3 127 90.7 140 100.0 OA obj. adv. 13 27.1 35 72.9 48 100.0 OP obj. pred. 13 23.6 42 76.4 55 100.0 Table 7.4: Absolute (#) and relative frequencies (%) of nouns and pronouns anno- tated as persons and non-persons in the written sections of Talbanken05, broken down by dependency relation. reference (p<.0000). The core arguments include the indirect object function. Whereas for the subject and object functions some variation is to be expected, indirect objects have been noted to exhibit a strong preference for animate real- ization (Bresnan et al. 2005) and one would expect non-persons to be virtually non-occurring in this relation. However, a closer look at the data indicate that animals, as in (105), collective nouns, as in (106), and organization nouns, as in (107), are in majority among the elements annotated as non-persons in indi- rect object position. There are also some clearly inanimate indirect objects, as in (108) and (109). (105) . . . att . . . that man one ofta often ger gives hunden dog-DEF ett a mål portion mat food per per dag day ‘. . . that one often gives the dog one meal per day’ 7.1 Obtaining animacy data 135 (106) TV tv gav gave familjen family-DEF en a ny new samlingspunkt gatheringpoint ‘Television provided the family with a point of union’ (107) Forsmark Forsmark kraftstation powerstation kommer will att to tillföra supply kommunen municipality-DEF ett a avsevärt considerable tillskott increase ‘Forsmark power station will supply the municipality with a considerable increase’ (108) På on gatsten cobble-stone och and betong concrete kan can ett a fulldubbat studded däck tire ge give bilen car-DEF helt totally livsfarliga life-dangerous egenskaper properties ‘A studded tire can, on cobble stone or concrete, provide the car with life-threatening properties’ (109) Idag today försöker tries man on i in regel rule ge give det this här here argumentet argument-DEF en a positiv positive formulering expression ‘Nowadays one usually tries to give this argument a positive expression’ The examples in (105)–(109) illustrate the flexibility of the ditransitive con- struction with respect to its possible arguments. The example in (105) is typical for what one might call the prototypical ditransitive ‘giving’-situation, where an animate agent transfers a concrete object to another animate participant. In (106) the subject is not an agent but rather expresses a causing event (the ac- quiring of a television set). The examples in (108) and (109) also show a more abstract instantiation of the prototypical giving involving inanimate recipients and no sense of transfer at all. Differentiating properties of the arguments may vary with the sense of the dative verb in question, e.g. whether give is em- ployed in a transfer sense, communication sense or abstract sense (Bresnan et al. 2005). However, we may also establish a general trend with respect to properties of the arguments such as animacy, in line with the findings for En- glish. We see from table 7.4 that there are a range of other, non-argument syn- tactic relations in which person referring nominals occur. In general, person reference is less frequent in these relations and we find that the argument and 136 Acquiring animacy – scaling up non-argument relation differ significantly (p<.0000) with respect to person reference. Even so, we do find a number of the person referring nominals in non- argument functions, most notably, functioning as determiners (DT) or preposi- tional complements (PA). The person referring determiners turn out to be al- most exclusively genitive modifiers, hence corroborate the correlation between animacy and genitive expression studied in chapter 6 for Norwegian. The fact that these account for such a small proportion of the nominal determiners is that, as noted earlier, nominal pronouns and determiners are assigned the same part-of-speech in Talbanken05. As table 7.4 indicates, prepositional complements (PA) show a clear prefer- ence for non-person reference. Although this tendency is clear in itself, a more detailed and informative picture emerges if we consider the type of prepo- sitional head the nominals in question are governed by. Table 7.5 shows the percentage of person/non-person referring elements among the nominal prepo- sitional complements in Talbanken05, broken down by the most frequent gov- erning prepositions.120 We find that for a majority of the prepositions, there is a strong tendency for non-person complements and some of these take almost exclusively (100- 99%) non-person complements: vid ‘by/next-to’, före ‘before’, utanför ‘out- side’, sedan ‘since’, i ‘in’, efter ‘after’, inom ‘inside’, under ‘under’. These are all prepositions which position their complement spatially or temporally. The converse situation is much more rare, only two prepositions, mellan ‘among’ and hos ‘at’ show a stronger tendency for person complements than non-person ones. The fact that some prepositions show a stronger preference for person- denoting complements is somewhat surprising. For instance, the preposition hos ‘at somebody’s’ is typically used to position a person, but in the Talbanken data it has only 58.8% person complements. However, if we examine the data a little closer, we find that most of the complements are actually not typically inanimate even though they are annotated as non-persons. Of the complements of hos which are tagged as non-person, 61.5% denote animals, as in (110), and 23.1% organizations, as in (111). In fact, only 15.4% are actually inanimate, as in (112). (110) Liknande similar förhållanden circumstances finner finds man one hos at häckande hatching måsar seagulls ‘One finds similar circumstances among hatching sea gulls’ (111) Hos at försäkringskassan insurance-company-DEF finns exists särskild special broschyr brochure ‘At the insurance company they have a special brochure’ 120Table 7.5 presents only prepositions with an absolute frequency of more than 100 occur- rences. 7.1 Obtaining animacy data 137 Person Non-person Total Preposition # % # % # % i ‘in’ 31 0.8 3804 99.2 3835 100.0 av ‘by/of’ 479 21.7 1732 78.3 2211 100.0 på ‘on’ 184 9.7 1714 90.3 1898 100.0 för ‘for’ 463 31.8 991 68.2 1454 100.0 med ‘with ’ 346 26.0 984 74.0 1330 100.0 till ‘to’ 182 14.3 1091 85.7 1273 100.0 om ‘about’ 60 10.1 535 89.9 595 100.0 från ‘from’ 40 8.9 410 91.1 450 100.0 vid ‘beside’ 14 4.1 325 95.9 339 100.0 under ‘under’ 2 0.6 312 99.3 314 100.0 mellan ‘between’ 154 58.3 110 41.7 264 100.0 mot ‘against’ 48 20.6 185 79.4 233 100.0 inom ‘within’ 3 1.4 213 98.6 216 100.0 efter ‘after’ 3 1.4 212 98.6 215 100.0 enligt ‘following’ 50 23.3 165 76.7 215 100.0 genom ‘through’ 9 4.6 188 95.4 197 100.0 hos ‘at/among’ 90 58.8 63 41.2 153 100.0 utan ‘without’ 3 2.3 127 97.7 130 100.0 ur ‘out-of’ 4 3.3 118 96.7 122 100.0 Table 7.5: Absolute (#) and relative frequencies (%) of person and non-person noun complements for prepositions; ranked by total, absolute frequency in the written sections of Talbanken05. (112) Därför therefore lägger lays man one vikt weight vid on andra other egenskaper properties hos at bilen car-DEF ‘This is why one emphasizes other properties of the car’ The above examples of indirect objects in (105)–(109) and prepositional com- plements in (110)–(112), illustrate the fact that the person/non-person distinc- tion as such does not incorporate the more fine-grained distinctions often rep- resented in an animacy hierarchy. The general tendency for PA dependency relation is thus a strong preference for non-person reference. Other non-argument relations in which we find person referring nominals include the adverbial relations of comparative adverbials (KA), as in (113), and object adverbials (OA), with 32.4% and 27.1%, respectively. We also find person referring elements (17.9%) among the appositions (AN), as in (114). 138 Acquiring animacy – scaling up (113) Utlänningar foreigners betalar pay skatt taxes som as svenskar Swedes ‘Foreigners pay taxes just like Swedes’ (114) Föräldrarna, parents-DEF, särskilt especially fäderna, fathers-DEF, ägnar devote alltför too . . . . . . ‘The parents, especially the fathers, devote too much . . . ’ To summarize, we find that the distribution of person/non-person in the Swedish data instantiates the general pattern of animacy with respect to argument differ- entiation, as discussed in chapter 3. We also find that person referring nominals are not limited to the argument relations, but also occur as nominal heads of some non-argument relations. Annotation consistency In light of our earlier discussion on issues in annotation for animacy, it might be interesting to examine a few properties of the annotation a bit closer. With respect to annotation of nouns, we may differentiate between a purely denota- tional (type level) annotation strategy and a purely referential (token level) one. A denotational strategy entails that an element consistently be assigned to only one class. A referential strategy, in contrast, does not impose this restriction on the annotation, hence class assignment may vary depending on the specific context. The brief instruction given in the annotation manual for Talbanken05 (Teleman 1974: 223) gives leeway for interpretation in the annotation. With the general aim of obtaining animacy data for supervised animacy classification, an extraction of person information from Talbanken at the level of noun lemmas will clearly be problematic if there is a lot of variation in class assignment at the level of tokens. We may thus examine the intersection of the two classes for noun lemmas in the written sections of Talbanken, i.e. the set of nouns which have been assigned both classes. It contains 82 noun lemmas, which corresponds to 1.1% of the total number of noun lemmas in Talbanken (7554 lemmas all together). This is clearly such a small proportion that it should not be problematic to employ the annotation at the lemma level. After an inspection of the intersective elements, we may group the nouns which were assigned to both classes, roughly into the following categories:121 Abstract nouns These are nouns with underspecified or vague denotational (type-level) properties with respect to animacy, such as quantifying nouns 121Recall that ‘HH’ is the tag for person referring, whereas the lack of such a tag, ‘_’, denotes a non-person referring element. 7.1 Obtaining animacy data 139 whose reference is determined mostly by context, e.g. hälft ‘half’, miljon ‘million’, nästa ‘next’, as well as other nouns which may be employed with varying animacy, e.g. element ‘element’, fiende ‘enemy’, part ‘party’, as in (115) and (116): (115) men but det that förutsätter presupposes att that också also den the andra other partenHH party-DEF står stands utanför outside ‘but that presupposes that the other party is also left outside’ (116) I in ett a förhållande relationship är are aldrig never bägge both parter _ parties lika same starka strong ‘In a relationship, both parties are never equally strong’ We also find that nouns which denote abstract concepts regarding humans show variable annotation, e.g. individ ‘individual’, adressat ‘addressee’, med- lem ‘member’, kandidat ‘candidate’, representant ‘representative’, auktoritet ‘authority’ Reference shifting contexts These are nouns whose denotational animacy is quite clear but which are employed in a specific context which shifts their reference. Examples include metonymic usage of nouns, as in (117) and nouns occurring in dereferencing constructions, such as predicative constructions (118), titles (119) and idioms (120): (117) Trots despite daghemmensHH kindergarten-DEF.GEN otillräckliga inadequate resurser resources . . . . . . ‘Despite the kindergarten’s inadequate resources . . . ’ (118) . . . för . . . for att to bli become en a bra good soldat _ soldier ‘. . . in order to become a good soldier’ (119) . . . menar . . . thinks biskop _ bishop Hellsten Hellsten ‘thinks bishop Hellsten’ (120) ta take studenten _ student-DEF ‘graduate from highschool (lit. take the student)’ 140 Acquiring animacy – scaling up Annotation errors There is some variation in annotation which we suspect are annotation errors, e.g., (121)–(122) below. We also find instances that were assigned to the wrong lemma due to mistakes in lemmatization (e.g. moder ‘mother’ lemmatized to mod ‘courage’). (121) han he får may inte not föra lead de the direkta direct förhandlingarna negotiation-DEF.PL på in minst at-least ett one halvårHH half-year ‘he is not allowed to lead the direct negotiation for at least another half year’ (122) Om if djup deep disharmoni disharmony mellan between föräldrarna _ parents-DEF dessutom also äventyrar adventure barnens child-DEF.GEN hälsa health . . . . . . ‘If a deep disharmony between the parents also jeopardizes the children’s health . . . ’ It is interesting to note that the main variation in annotation stems precisely from difficulties in determining reference, either due to bleak denotational properties such as for the abstract nouns, or due to properties of the context, as in the reference shifting constructions. 7.2 Data preliminaries This section presents the data sets employed in the scaled up classification experiments. We examine the set of nouns, as well as feature representation and feature extraction, which all constitute important elements in a supervised machine learning experiment. 7.2.1 Talbanken05 nouns These data sets consist of the noun lemmas with corresponding class (person/ non-person) extracted from the Talbanken05 material, detailed above.122 Fol- lowing the conclusions at the end of section 7.1.2, we here approximate the class of ‘animate’ to ‘person’ and the class of ‘inanimate’ to ‘non-person’. Table 7.6 provides an overview of the data set resulting from extraction from 122The treebank was lemmatized prior to extraction (Kokkinakis 2001). 7.2 Data preliminaries 141 Class Types Tokens covered Animate 644 6010 Inanimate 6910 34822 Total 7554 40832 Table 7.6: The animacy data set from Talbanken05; number of noun lemmas (Types) and tokens in each class. Talbanken.123 Intersective elements, see section 7.1.2, were assigned to their majority class.124 It is clear that the data is highly skewed towards the non-person class, which accounts for 91.5% of the data instances. We may also note that the type-token ratio differs somewhat for the two classes. Person nouns exhibit less lexical variation than non-person nouns; each person noun type occurs on average nine times, whereas the corresponding figure for non-person nouns is five. 7.2.2 Features In chapter 6 we made use of a set of theoretically motivated, distributional features to represent various aspects of the syntactic properties of the nouns that were classified. In particular, we found that the features encoding sub- ject, direct object and genitive were strong features for animacy classification. Whether or not these features are important also in the current setting remains to be tested empirically. There may also be other features which are impor- tant in the scaling to a new, larger set of nouns and a new, although closely related, language. We therefore construct a very general feature space for ani- macy classification, which makes use of distributional data regarding the gen- eral syntactic properties of a noun, as well as various morphological properties. It is clear that in order for a syntactic environment to be relevant for animacy classification it must be, at least potentially, nominal. We define the nominal potential of a dependency relation as the frequency with which it is realized by a nominal element (noun or pronoun) and determine empirically a threshold of 0.10. The syntactic and morphological features in the general feature space are presented below: Syntactic features A feature for each dependency relation with nominal po- 123Note that the figures in table 7.6 differ from those presented in table 7.2 above, as the current data set only contains common nouns, not proper names. 124When there is no majority class, i.e. in the case of ties, the noun was removed from the data set. 12 lemmas were consequently removed from the data set. 142 Acquiring animacy – scaling up tential: (transitive) subject (SUBJ)125, object (OBJ), prepositional com- plement (PA), root (ROOT)126, apposition (APP), conjunct (CC), deter- miner (DET), predicative (PRD), complement of comparative subjunc- tion (UK). We also include a feature for the complement of a genitive modifier, the so-called ‘possessee’, (GENHD). Morphological features A feature for each morphological distinction rele- vant for a noun: gender (NEU/UTR), number (SIN/PLU), definiteness (DEF/IND), case (NOM/GEN). Also, the part-of-speech tags distinguish dates (DAT) and quantifying nouns (SET), e.g. del, rad ‘part, row’, so these are also included as features. 7.2.3 Feature extraction In chapter 6, the distributional data for the individual noun lemmas was ex- tracted from a fairly large, automatically parsed corpus of Norwegian. For ex- traction of distributional data for the set of Swedish nouns we make use of the Swedish Parole corpus, see section 5.1.2. To facilitate feature extraction, we part-of-speech tag the corpus and parse it with the MaltParser, which assigns a dependency analysis.127 Table 7.7 shows an overview of the aggregated mean values, along with standard deviations, from the Parole corpus for each class of Talbanken05 noun (Animate or Inanimate) broken down by the various features. Despite the fact that these values are from a noisy, automatically annotated corpus, we observe many of the same tendencies as noted in the treebank material discussed ear- lier. We find clear distributional differences between the classes in a range of syntactic relations, most notably in argument positions (SUBJ, OBJ), as prepo- sitional complement (PA) etc. For the extraction of the SUBJ and OBJ features in chapter 6, we took advantage of the containment of ambiguity which char- acterizes Constraint Grammar analysis, see section 5.1.3, and extracted only subjects and objects which were structurally unambiguous. The data extrac- tion for Swedish is in this respect more noisy, since the dependency analysis 125An element is a transitive subject if it has a direct object sibling. 126Nominal elements may be assigned the root relation in sentence fragments which do not include a finite verb. 127For part-of-speech tagging, we employ the MaltTagger – a HMM part-of-speech tag- ger for Swedish (Hall 2003). The pretrained model for Swedish employs the SUC tagset (http://spraakbanken.gu.se/parole/tags.phtml). For parsing, we employ MaltParser, see section 5.3.1 with the pretrained model for Swedish, which has been trained on the SUC-tags output by the tagger. It makes use of a smaller set of dependency relations than those found in Tal- banken05. 7.2 Data preliminaries 143 Animate Inanimate Mean SD Mean SD Syntactic SUBJ 0.21 0.12 0.08 0.07 OBJ 0.13 0.07 0.19 0.13 PA 0.21 0.10 0.40 0.18 ROOT 0.03 0.03 0.03 0.05 APP 0.03 0.03 0.01 0.03 CC 0.12 0.08 0.09 0.08 DET 0.12 0.16 0.04 0.09 PRD 0.07 0.08 0.04 0.06 UK 0.04 0.05 0.01 0.03 GENHD 0.03 0.05 0.04 0.06 Morphological gender NEU 0.05 0.21 0.29 0.45 UTR 0.95 0.21 0.71 0.45 number SIN 0.51 0.34 0.75 0.30 PLU 0.48 0.34 0.24 0.29 definiteness DEF 0.34 0.24 0.33 0.25 IND 0.66 0.24 0.66 0.25 case NOM 0.93 0.17 0.96 0.12 GEN 0.07 0.17 0.03 0.09 date DAT 0.00 0.00 0.01 0.07 set SET 0.00 0.00 0.01 0.08 Table 7.7: Mean relative frequencies and standard deviation for each feature by class following feature extraction from Parole for nouns of absolute frequen- cies >10. does not indicate structural ambiguity. It is interesting to note that the tenden- cies are still very similar despite the noise of the data. With respect to morphological properties, we observe differences in dis- tribution with respect to gender and number where animates have a stronger preference for non-neuter gender (UTR) than inanimate, and, conversely, inani- mate nouns exhibit a stronger preference for neuter gender than animate nouns. With respect to number, we, somewhat surprisingly, note that there is a stronger preference for singular number for inanimate nouns than animate, and the con- verse with respect to plurality. However, these features exhibit a high degree of variation and we find that certain nouns which almost exclusively occur in singular or plural affect these aggregated results. For instance, abstract inani- mate nouns like död ‘death’ or ansvar ‘responsibility’ occur exclusively in the singular. Recall, however, that the feature representations of the nouns consist 144 Acquiring animacy – scaling up 0 50000 100000 150000 1 10 10 0 10 00 10 00 0 rank fre q Figure 8: Rank frequency profile of all Parole nouns. of normalized counts for that specific noun and not the aggregated means for each class as presented in table 7.7. Any lexical preferences with respect to morphology are thus properties of the individual nouns supplied to the classi- fier. In chapter 6, we examined the effect of sparse data on classification. It is to be expected that the problem of sparse data becomes more severe as we attempt to scale up the animacy classification. The rank/frequency profile of common nouns in Parole is illustrated in figure 8.128 It shows a Zipfian curve which is typical of word frequencies in natural language, where a few noun lemmas are highly frequent and an increasing number of lemmas have lower frequencies. The greatest number of lemmas, as illustrated by the “tail” in figure 8 occur only once, so-called hapax legomena. In the experiments in chapter 6, we sorted the data into various frequency 128A rank/frequency profile illustrates the token frequencies of the ranked types. The fre- quency is plotted on a logarithmic scale, since there is such a large discrepancy between the token frequencies of the top ranked types, compared to the lower ranked ones (Baroni 2007). 7.2 Data preliminaries 145 Animate Inanimate Total Bin Freq # % # % # ∼1000 >1000 31 4.8 260 3.8 291 ∼500 999-500 35 5.4 271 3.9 306 ∼100 499-100 92 14.3 979 14.2 1071 ∼50 99-50 57 8.9 553 8.0 610 ∼10 49-10 132 20.5 1376 19.9 1508 ∼1 9-1 132 20.5 1563 22.6 1695 0 0 165 25.6 1908 27.6 2073 Total 644 100.0 6910 100.0 7554 Table 7.8: Animate and inanimate Talbanken05 nouns in frequency bins by Parole frequency. Animate Inanimate Total Bin # % # % # % >1000 31 10.7 260 89.3 291 100.0 >500 66 11.1 531 88.9 597 100.0 >100 158 9.5 1510 90.5 1668 100.0 >50 215 9.4 2063 90.6 2278 100.0 >10 347 9.2 3439 90.8 3786 100.0 >0 479 8.7 5002 91.3 5481 100.0 Table 7.9: Animate and inanimate Talbanken05 nouns in accumulated frequency bins by Parole frequency. bins in order to examine the effect of sparse data on the classification perfor- mance. In table 7.8 we see the nouns from Talbanken05 organized into fre- quency bins by their absolute frequencies in the Parole corpus. For both an- imate and inanimate nouns, we find the same general tendency illustrated by the rank/frequency profile, indicated by an increasing number of types with lower frequencies. We observe that 30% of the Talbanken05 noun lemmas do not occur at all in the Parole corpus, hence will not be included in the data sets for classification. Since the main focus of the current chapter is to scale up the animacy classi- fication to realistic data sets, we mostly employ data sets consisting of accumu- lated frequency bins, which include all nouns with frequencies above a certain threshold. The data organized into accumulated frequency bins is presented in table 7.9. 146 Acquiring animacy – scaling up 7.3 Experiments The experiments described in the following address some important issues in the scaling up of our method from chapter 6. In particular, we discuss how the skewed distribution of classes, as noted above, affects the results and exam- ine the interaction with the additional, complicating factor of data sparseness. The overall focus will be on locating features which are stable class predictors across different machine learners and for data sets of varying properties. 7.3.1 Experimental methodology In chapter 6, we compared eager and lazy machine learning algorithms for the task of animacy classification. We looked at the use of decision-trees ac- quired with the eager c4.5 algorithm (Quinlan 1993) and compared it with memory-based learning which employs lazy learning with the k-nearest neigh- bor algorithm. In section 6.5 we did not find any statistically significant dif- ferences between the two learning algorithms and conjectured that the size of the data set influenced the measure of significance conservatively. In the cur- rent chapter we have a considerably larger data set, hence we may once again compare performance between the two machine learning algorithms. In the following experiments we will continue to employ both types of learners and contrast the two wherever appropriate. For decision tree learning, we employ C5.0 with boosted classifiers, see section 5.2.1, and for memory-based learn- ing we employ TiMBL, see section 5.2.2, with the basic settings, resulting from the parameter optimization described in section 6.5.2, unless otherwise stated.129 For training and testing of the classifiers, we make use of leave- one-out cross-validation. The baseline represents assignment of the majority class (inanimate) to all nouns in the data set. Due to the skewed distribution of classes, as noted above, the baseline accuracy is very high, usually around 90%. Clearly, however, the class-based measures of precision and recall, as well as the combined F-score measure are more informative for these results. The baseline F-score for the animate class is thus 0.0%, and a main goal is to improve on the rate of true positives for animates, while limiting the trade-off in terms of performance for the majority class of inanimates, which start out with F-scores approaching 100. For calculation of the statistical significance of differences in the performance of classifiers tested on the same data set, McNemar’s test (Dietterich 1998) is employed. 129Recall that the basic settings correspond to k = 1 with no feature weighting. 7.3 Experiments 147 Bin Baseline MBL DecTree >1000 89.3 89.0 90.7 >500 88.9 90.3 93.3 >100 90.5 89.8 93.7 >50 90.6 89.4 93.3 >10 90.8 89.0 92.2 >0 91.3 90.0 92.1 Table 7.10: Accuracy for MBL and DecTree classifiers with the original feature set (SUBJ, OBJ, GEN) on Talbanken05 nouns in accumulated frequency bins. 7.3.2 Original features The experiments on Norwegian in chapter 6 showed that the three features sub- ject, object and genitive case were the most robust features. Table 7.10 shows the results from classification of the Talbanken05 nouns with the distributional features SUBJ, OBJ and GEN extracted from Parole, as described in sections 7.2.2–7.2.3 above. The experiments were run on accumulated frequency bins, where each data set contains all data instances of higher frequencies, e.g. the > 50 data set contains all nouns of frequencies higher than 50. We observe a clear difference between the results for the lazy (MBL) and eager (DecTree) machine learners. The performance of the MBL-classifier is never significantly better than the baseline and for the >100, >50, >10 and >0 data sets, the performance is in fact significantly worse than the baseline. The DecTree-classifier in contrast performs significantly better than baseline on all data sets of frequencies <1000.130 Tables 7.11 and 7.12 show the experimental results relative to class for the lazy and eager learner, respectively. For both classifiers, we find that the per- formance for the inanimate class is fairly stable, whereas the performance for the animate class deteriorates as more infrequent nouns are added to the data set. We find that the performance for the animate class is quite low (varying between top 66.1 and bottom 31.2), regardless of learner, and performance is clearly affected by the frequency of the data instances. If we compare the class results for the two learners, we find that the main difference is found in a better animate precision and inanimate recall for the DecTree-classifier. These are clearly advantageous properties in dealing with the skewed class distribution and counteracting overgeneralization from the less frequent class. 130The decision tree classifier does not differ significantly from the baseline for the >1000 data sets, but differs significantly from the baseline at the p<.001-level for the >500, >10 and >0 data sets, and at the p<.0001-level for the >100 and >50 data sets. 148 Acquiring animacy – scaling up MBL Animate Inanimate Precision Recall Fscore Precision Recall Fscore >1000 48.5 51.6 50.0 94.2 93.5 93.8 >500 56.1 56.1 56.1 94.5 94.5 94.5 >100 46.3 47.5 46.9 94.5 94.2 94.4 >50 43.7 42.0 42.8 94.0 94.4 94.2 >10 38.8 34.3 36.4 93.4 94.5 94.0 >0 39.4 26.5 31.7 93.2 96.1 94.6 Table 7.11: Precision, recall and F-scores for the two classes in MBL-experiments with original features (SUBJ, OBJ, GEN). DecTree Animate Inanimate Precision Recall Fscore Precision Recall Fscore >1000 57.1 51.6 54.2 94.3 95.4 94.8 >500 75.0 59.1 66.1 95.0 97.6 96.3 >100 79.8 44.9 57.5 94.5 98.8 96.6 >50 72.8 46.0 56.4 94.6 98.2 96.4 >10 67.6 28.2 39.8 93.2 98.6 95.8 >0 65.8 20.5 31.2 92.9 99.0 95.8 Table 7.12: Precision, recall and F-scores for the two classes in DecTree- experiments with original features (SUBJ, OBJ, GEN). 7.3.2.1 Uniform distribution Based on the results from the experiments with original features, it is difficult to say anything about the general applicability of these features (SUBJ, OBJ, GEN) to the Swedish nouns. This is mainly due to the fact that the data exhibits a very skewed distribution of classes, hence training data for the animate class is limited. In order to test the generalizability of the original distributional features further and tease apart the influence of a skewed class distribution from that of data sparseness, an additional experiment is performed on data sets with a uniform distribution of classes.131 Table 7.13 shows the results for the uniformly distributed data sets and 131The uniform data sets are constructed from all animate instances in a data set and the corresponding proportion of randomly selected inanimate instances, balanced with respect to absolute frequencies. This technique for dealing with skewed data sets is known in the machine learning literature as ‘down-sampling’ and denotes the removal of instances of the majority class for training, see, e.g., Hoste 2005. 7.3 Experiments 149 Bin Baseline MBL DecTree >1000Uni 50.0 90.3 90.3 >500Uni 50.0 88.6 83.3 >100Uni 50.0 79.1 82.0 >50Uni 50.0 78.6 83.3 >10Uni 50.0 72.8 76.8 >0Uni 50.0 68.6 72.2 Table 7.13: Accuracy for MBL and DecTree learners with the original feature set (SUBJ, OBJ, GEN) on Talbanken05 nouns with uniform class distribution in accumulated frequency bins. MBL DecTree Animate Inanimate Animate Inanimate >1000Uni 90.3 90.3 90.0 90.6 >500Uni 88.7 88.5 84.1 82.5 >100Uni 78.8 79.4 81.7 82.2 >50Uni 78.3 78.9 83.0 83.5 >10Uni 72.8 72.7 73.3 79.5 >0Uni 67.0 70.0 68.6 75.1 Table 7.14: F-scores for the two classes in experiments with original features (SUBJ, OBJ, GEN) and uniform class distribution. contrasts eager and lazy learning, as before. We observe a clear reduction in error rate (80.6%-37.2%) for all classifiers compared to a random baseline. As the F-scores for each class in table 7.14 illustrate, the uniform distribution of classes gives balanced results for the individual classes as well. This shows that the set of motivated, robust features identified in the previous chapter are good class predictors also for Swedish and larger sets of naturally occurring nouns. It is also clear, as discussed in chapter 6, that data sparseness has a clear effect on the results, regardless of the class distribution. Results deteriorate gradually as more infrequent nouns are added, from accuracies of 90.3 for the >1000Uni data set to an average 70.4 for the >0Uni data set. In the previous section we observed a difference in performance between the lazy and eager learner. In the present experiment, we find significant differ- ences only for the data sets >50Uni, >10Uni and >0Uni.132 This indicates that it is the ability to deal with data sparseness which is the main source of difference 132The performance of the decision tree classifier differs significantly from that of the MBL- learner at the p<.05 level for the >50Uni and >10Uni data sets, and at the p<.01-level for the >0Uni data set. 150 Acquiring animacy – scaling up Bin Baseline MBL DecTree ∼1000Uni 50.0 90.3 90.3 ∼500Uni 50.0 88.6 88.6 ∼100Uni 50.0 79.1 81.6 ∼50Uni 50.0 78.6 79.8 ∼10Uni 50.0 72.8 71.2 ∼1Uni 50.0 68.6 60.0 Table 7.15: Accuracy for MBL and DecTree classifiers with the original feature set (SUBJ, OBJ, GEN) on Talbanken05 nouns with uniform class distribution in individual frequency bins. between the two, rather than the skewed distribution of data. In chapter 6 we noted effects to the opposite, but, without enough noun instances to conclude on significant effects. We find that the data sparseness is better dealt with by the decision tree learner, given that there is sufficient data to generalize over. We must differentiate between the size of the data set and the sparsity of the data set. Table 7.15 illustrates this point further, showing results for experi- ments which are run on individual frequency bins, rather than accumulated ones. This provides an identical setting to the experiments on lower frequency nouns in chapter 6. These data sets are thus considerably smaller than their accumulated counterparts and once again, we find no significant differences between the two classifiers. The conclusion is therefore that decision trees per- form better than MBL over sparse instances, given a larger data set than MBL. We must, however, note that the notion of similarity embodied in the MBL- settings is not updated to take into account a larger data set, which gives a somewhat unfair comparison, a point to which we return in the next section. We may conclude from the above that both data sparseness and skewed class distribution are serious issues in the scaling up of our classification task. We find that the skewed distribution causes an unbalanced result for the lower frequency class of animate nouns. We also observe the general detrimental tendency of sparse data, regardless of class distribution and size of data set. We find that it may be partially counteracted by the size of the data set, how- ever there is clearly room for improvement. In dealing with more infrequent nouns it is clear that the three features employed above do not provide suffi- cient class discrimination. In the following we will therefore investigate some strategies to obtain more informed learners. In particular, we examine an ex- tended feature space, as well as optimizing the notion of similarity employed during classification. 7.3 Experiments 151 7.3.3 General feature space The general feature space described in 7.2 above gives more distributional data for each individual noun. This can be an advantage in the light of the skewed distribution and data sparseness discussed above, since it enables a more in- formed measure of similarity between the instances. However, whether the more general feature space capture generalizations which are relevant to the animacy dimension is a claim which has to be tested empirically. Prior to the experiments, the TiMBL settings were optimized on a subset of the full data set, giving us a set of optimized MBLopt classifiers.133 The param- eter optimization shows that a larger set of nearest neighbors, as well as feature weighting and weighted class voting provide for better generalizations over the data. All of these parameters contribute to a more discriminating notion of sim- ilarity which is an important factor in successfully exploiting the information contained in an enlarged feature space, as well as the earlier mentioned skewed distribution and data sparseness. Table 7.16 shows the accuracy obtained with all features in the general feature space in terms of accuracy. We find significant improvements compared to the baseline for all data sets except the >0 data set, where performance for the unoptimized, lazy learner (MBL) is at baseline. The DecTree and MBLopt classifiers are clearly superior to the unoptimized MBL classifier, hence will be focused on in the following.134 We observe a clear improvement with the general feature space compared to the baseline. The performance of the DecTree classifier on the >1000 data set is significant at the p<.01 level, whereas the MBLopt-classifier differs at the p<.001 level on this same data set. Performance on all the other data sets show highly significant reduction of errors (p<.0001) for both classifiers. As we recall, the data sets are successively larger, hence it seems fair to conclude that the size of the data set partially counteracts the lower frequency of the test nouns. It is not surprising, however, that a method based on distributional features suffers when the absolute frequencies approach 1. Tables 7.17-7.18 present the experimental results relative to class. We find that, as noted earlier in chapter 6, it is largely the animate class which suffers from the addition of lower frequency nouns. Even so, the classification of animate instances is 133For parameter optimization we employ the paramsearch tool, supplied with TiMBL, see http://ilk.uvt.nl/software.html. paramsearch implements a hill climbing search for the optimal settings on iteratively larger parts of the supplied data. We performed parameter optimization on 20% of the total >0 data set, where we balanced the data with respect to frequency, concate- nating equal proportions from each respective frequency bin. The resulting settings are k = 11, GainRatio feature weighting and Inverse Linear (IL) class voting weights. 134Differences between the two MBL-classifiers with general features are significant for all data sets. 152 Acquiring animacy – scaling up Bin Baseline MBL DecTree MBLopt >1000 89.3 96.2 94.5 97.3 >500 88.9 95.1 96.1 97.3 >100 90.5 95.6 96.6 96.8 >50 90.6 94.8 95.7 96.1 >10 90.8 93.1 94.6 95.4 >0 91.3 91.9 93.9 93.9 Table 7.16: Accuracy for MBL and DecTree classifiers with a general feature space on Talbanken05 nouns in accumulated frequency bins. DecTree Animate Inanimate Precision Recall Fscore Precision Recall Fscore >1000 82.6 61.3 70.4 95.5 98.5 97.0 >500 86.4 77.3 81.6 97.2 98.5 97.8 >100 89.1 72.8 80.1 97.2 99.1 98.1 >50 87.3 64.2 74.0 96.4 99.0 97.7 >10 76.8 59.1 66.8 96.0 98.2 97.1 >0 79.8 40.5 53.7 94.6 99.0 96.7 Table 7.17: Precision, recall and F-scores for the two classes in DecTree- experiments with a general feature space. MBLopt Animate Inanimate Precision Recall Fscore Precision Recall Fscore >1000 89.7 83.9 86.7 98.1 98.8 98.5 >500 89.1 86.4 87.7 98.3 98.7 98.5 >100 87.7 76.6 81.8 97.6 98.9 98.2 >50 85.8 70.2 77.2 97.0 98.9 97.9 >10 81.9 64.0 71.8 96.4 98.6 97.5 >0 75.7 44.9 56.4 94.9 98.6 96.7 Table 7.18: Precision, recall and F-scores for the two classes in MBLopt -experiments with a general feature space. notably improved compared to the experiment with original features. We also find that the performance for the inanimate class is quite stable throughout the experiments (ranging from 98.5-96.7), a fact which is important since these are in clear majority in the data set. The MBLopt-classifier consistently performs slightly better than the Dec- 7.3 Experiments 153 Bin Baseline MBL-Morphopt MBL-Syntaxopt >1000 89.3 89.7 94.5 >500 88.9 90.5 95.1 >100 90.5 91.5 95.5 >50 90.6 91.3 95.0 >10 90.8 91.1 94.0 >0 91.3 91.5 93.2 Table 7.19: Accuracy for MBLopt -classifiers with feature subsets on Talbanken05 nouns in accumulated frequency bins. Tree classifier, although only differences for the >1000 and >10 data sets are significant. 7.3.4 Feature analysis Unlike the experiments reported in chapter 6, the features employed for rep- resentation of the nouns in the general feature space are not all linguistically motivated indicators of an animacy distinction. The above experiments, how- ever, indicate that these features provide important clues for the animacy of nouns. In the following we analyze the influence of the various features from different perspectives. 7.3.4.1 Feature subsets – syntax vs. morphology The general feature space consists of both syntactic and morphological fea- tures and the above experiments have indicated the importance of both of these feature types. The extent to which it is morphology or syntax which is most important in ascertaining animacy, however, is not clear. One way of contrast- ing the importance of syntactic and morphological distribution in determining animacy, is to run classification experiments with feature subsets of syntactic and morphological features. In order to test the influence of syntactic versus morphological features we trained and optimized MBL classifiers for each of these feature subsets, as defined in section 7.2.2 and summarized in table 7.7 above. The overall results in terms of accuracy are presented in table 7.19. The results clearly indicate that the syntactic features are the strongest indi- cators of animacy. The classifiers employing only morphological features per- form around baseline or slightly above (p<.05) for the >100–>0 data sets. It 154 Acquiring animacy – scaling up is clear that the increased size of these data sets enable the acquisition of gener- alizations regarding morphological clues for animacy, but these are clearly not sufficient. The classifiers employing syntactic features perform notably better on their own, with all differences from the baseline significant (p<.0001).135 Even so, the necessity of both types of features is also corroborated by the results – the syntactic classifiers never outperform the classifiers combining morphological and syntactic evidence, all of which perform significantly bet- ter for all data sets. 7.3.4.2 Decision tree An advantage of decision tree learning is that the result of learning provides a generalization over the data set which may be inspected. A decision tree consists of a set of weighted, disjunctive tests which at each node in the tree assigns an appropriate test to an input, and which proceeds along one of its branches, representing possible outcomes of the test. All features are usually not employed in the tree, since smaller trees are preferred and the tree is pruned prior to application, see section 5.2.1 for more details. As an indicator of fea- ture importance we may therefore examine the decision trees in a bit more detail. Figure 9 presents the decision tree constructed for the >100 data set.136 The disjunctive tests applied at each step are of the form 〈attr Test value〉, where attr is a feature, value is a possible value of that feature and Test is the test operator. In this case all values are numerical and the operators are the binary numeric operators <,= and >. Each terminal node of the tree represents an as- signed class, and information regarding the correct/incorrect ratio of instances covered by that particular node is provided in the example tree. The decision tree in figure 9 embodies generalizations observed several places above. We find that the subject feature partitions the data set initially, with a cut-off of approximately 0.14. In fact, all the decision trees for the var- ious accumulated data sets employ the subject feature for initial partitioning. The largest branch (lines 1-13) is characterized by instances with lower pro- portions of subject occurrences and is dominated by inanimate terminals. In the same branch, a higher proportion of objects is employed to ascertain the inanimate class (line 5) and vice versa (line 6). We noted earlier on the distribu- 135To be precise, the difference from the baseline for the >1000 data set is at the p<.05 level, most likely due to the small size of the data set. 136The decision tree in figure 9 was constructed over the entire data set and is in this respect an idealization. Minor variations of this tree were actually employed under the experiments, since we applied leave-one-out cross validation. 7.3 Experiments 155 1 subj <= 0.1374663: 2 :...app > 0.04225352: 3 : :...prd <= 0.06315789: inan (19) 4 : : prd > 0.06315789: 5 : : :...obj <= 0.2116788: anim (10/1) 6 : : obj > 0.2116788: inan (5) 7 : app <= 0.04225352: 8 : :...cc <= 0.1607717: inan (1176/7) 9 : cc > 0.1607717: 10 : :...subj <= 0.1105528: inan (108/4) 11 : subj > 0.1105528: 12 : :...ind <= 0.6509434: inan (3) 13 : ind > 0.6509434: anim (5) 14 subj > 0.1374663: 15 :...uk <= 0.008849557: inan (135/8) 16 uk > 0.008849557: 17 :...prep > 0.3342618: 18 :...sin <= 0.1395349: anim (3) 19 : sin > 0.1395349: inan (43/1) 20 prep <= 0.3342618: 21 :...nom > 0.9925373: inan (18/4) 22 nom <= 0.9925373: 23 :...app > 0.02214452: anim (34) 24 app <= 0.02214452: 25 :...subj > 0.221519: 26 :...uk > 0.01548673: anim (47) 27 : uk <= 0.01548673: 28 : :...neu <= 0.0001496558: anim (12/3) 29 : neu > 0.0001496558: inan (4) 30 subj <= 0.221519: 31 :...root > 0.04166667: inan (4) 32 root <= 0.04166667: 33 :...nom <= 0.9231928: 34 :...sin <= 0.1753731: anim (2) 35 : sin > 0.1753731: inan (8) 36 nom > 0.9231928: 37 :...app <= 0.02048417: anim (29/4) 38 app > 0.02048417: inan (3) Figure 9: Decision tree acquired for the >100 data set in experiments with a general feature space. tional asymmetry with respect to prepositional complementation and we find that this generalization is also represented in the decision tree, where we find the majority of animate instances in the subtree dominated by a test for lower proportions of this construction (line 20). The earlier mentioned preference for genitive case is present through a restriction on the proportion of nominative occurrences (lines 21-38), which is mutually exclusive from the genitive. We also find some predictive environments which have not been studied in detail earlier. This is partially due to the fact that the parse model employed to parse Parole makes use of a slightly different tag set than the one found 156 Acquiring animacy – scaling up in Talbanken05, the subject of our study of animacy in section 7.1.2 above. For instance, the UK-tag is employed for predicative modifiers, as in (123) and (124), and the APP-tag is employed for appositions, as in (125), all taken from the Parole corpus: (123) Vi we jobbar work som as barnflickor nannies och and . . . . . . ‘We work as nannies . . . ’ (124) Han he ser looks ut out som like en a dansk danish murare mason ‘He looks like a Danish mason’ (125) Hannes Hannes Sköld, Sköld, stiftaren, founder-DEF var was knäckt devastated . . . . . . ‘Hannes Sköld, the founder, was devastated . . . ’ We observe that a lower value for the UK feature directly determines the inani- mate class (line 15) for a set of nouns with a higher proportion of subject occur- rences, but which are still predominantly inanimate. The classification of these instances has an accuracy of 94%. If we examine the classified instances, we find predominantly non-concrete, inanimate nouns like förslag ‘suggestions’, råd ‘advice’, studie ‘study’, utredning ‘investigation’, as well as a group of col- lective and organization nouns (31.1% of the instances), such as förening ‘as- sociation’, grupp ‘group’, kommun ‘municipality’, ledning ‘board’, personal ‘personell’ etc. As we noted in chapter 6, these nouns have high proportions of subject occurrences but are in Talbanken05 annotated as non-person referring. It is clear, however, that the nouns classified by this node occur more seldom as predicational modifiers, a construction which semantically requires more concrete and individuated arguments. In the decision trees, we observe a general tendency for syntactic features to appear higher in the tree, with morphological features occurring closer to the leaf nodes. In particular, the aforementioned SUBJ feature, as well as the features DET, UK, CC, PREP, APP and PRD are recurring features with high coverage in all the decision trees. The morphological feature representing sin- gular number SIN occurs in all decision trees, although with less general cover- age. This indicates that the syntactic features provide more general indications of animacy status, but that the morphological features provide the more fine- grained information which ultimately determines the class. Thus both types of features are needed, a result which the experiments clearly showed. 7.3 Experiments 157 Backward feature selection: • Generate a pool of features PF = {Fi} • Initialize the set of removed features RF with the empty set, RF = {} • loop: for each feature Fi ∈ PF – run a cross-validation on the training set without the features in RF and Fi – if improvement of accuracy: add Fi to RF • goto loop until no more improvement Figure 10: Algorithm for automatic feature selection with backward search. 7.3.4.3 Automatic feature selection The general feature space was constructed by including features for all an- notation relevant to nouns. The above experiments showed that extending the feature space proved beneficial to classification for all data sets. In order to prune the feature space for unnecessary features, we performed backward fea- ture selection from the general set of features. Backward feature selection starts out with the whole feature set and successively removes features, testing for improvement of results at each step. The algorithm for automatic feature selec- tion employing backward search is presented in figure 10 and has been adapted from the forward algorithm presented in Mihalcea 2002. We remove only on statistical significance of improvement. We perform automatic feature selec- tion on the >0 data set and find that the accuracy of the classifier improves slightly, from 93.9 to 94.0, but significantly (p< .05), following feature selec- tion. The small difference is caused by an improvement in the classification of the animate class, in particular in terms of precision which improves from 75.7 to 77.1. The advantage of backward selection is that it also gives us information regarding the importance of each individual feature along the lines of the “all minus one” testing in section 6.3.2. Important features will cause a deteri- oration of results when removed. We find that the only features which cause statistically significant deterioration of results on removal are the syntactic fea- tures SUBJ (p<.01), OBJ (p<.05) and DET (p<.001). As we saw in table 7.7 above, there was a clear distributional difference between the class of animate 158 Acquiring animacy – scaling up >10 nouns (a) (b) ← classified as 222 125 (a) class animate 49 3390 (b) class inanimate Table 7.20: Confusion matrix for the MBLopt classifier a general feature space on the >10 data set on Talbanken05 nouns. and inanimate with respect to the syntactic relation of determiner. It turns out that animate determiners are predominantly genitives, so these three features in fact embody little more than the subset of robust features established following chapter 6. The removal of the GENHD feature is the only case in which we find a sig- nificant improvement of results, on which this feature is permanently removed from the feature pool. Moreover, we find that all of the morphological fea- tures cause small, but insignificant deteriorations of results, as do the syntactic features PREP, ROOT, APP and PRD. 7.3.5 Error analysis In chapter 6, we examined a small set of nouns in more detail and the current chapter has largely dealt with quantitative analysis of performance results on the scaled up data sets. We found that the morphosyntactic features supported a more fine-grained notion of animacy and explored a three-way classification task. It might be interesting to examine the output from the scaled up classi- fier in a bit more detail, and, in particular, we may examine the errors. The error analysis examines the performance of the MBLopt-classifier employing all features on the > 10 data set in order to abstract away from the most serious effects of data sparseness. Table 7.20 shows a confusion matrix for the classification of the nouns. Recall from section 7.1.2 above that the person reference annotation of the Talbanken05 nouns distinguishes only the classes corresponding to ‘human’ and ‘inanimate’ along the animacy dimension. There is no intermediate notion of animacy or expression of gradience. An interesting question is whether this choice affects the results. If so, we would expect erroneously classified inan- imate nouns to contain nouns of intermediate animacy, such as animals and organizations. If we examine the errors for the inanimate class we indeed find evidence of gradience within this category. The errors contain a group of nouns referring 7.3 Experiments 159 to animals and other living beings (bacteria, algae), as listed in (126), as well as one noun referring to an “intelligent machine”, included in the intermediate animacy category (Zaenen et al. 2004). Collective nouns with human reference and organizations are also found among the errors, listed in (128). Both of these are more frequent among the errors (ANIM:18.4%; ORG:12.2%) than in the cor- pus sample studied in section 7.1 above (ANIM:1.8%; ORG:3.4%). We also find some nouns among the errors with human denotation, listed in (129). These are nouns which typically occur in dereferencing contexts, such as titles, e.g. herr ‘mister’, biskop ‘bishop’ and which were annotated as non-person referring by the human annotators.137 Finally, a group of abstract, human-denoting nouns are also found among the errors, as listed in (130). In summary, we find that nouns with gradient animacy properties account for 53.1% of the errors for the inanimate class. (126) Animals/living beings: alg ‘algae’, apa ‘monkey’, bakterie ‘bacteria’, björn ‘bear’, djur ‘animal’, fågel ‘bird’, fladdermöss ‘bat’, myra ‘ant’, mås ‘seagull’, parasit ‘parasite’ (127) Intelligent machines: robot ‘robot’ (128) Collective nouns, organizations: myndighet ‘authority’, nation ‘nation’, företagsledning ‘corporate-board’, personal ‘personell’, stiftelse ‘foundation’, idrottsklubb ‘sport-club’ (129) Human-denoting nouns: biskop ‘bishop’, herr ‘mister’, nationalist ‘nationalist’, tolk ‘interpreter’ (130) Abstract, human nouns: förlorare ‘loser’, huvudpart ‘main-party’, konkurrent ‘competitor’, majoritet ‘majority’, värd ‘host’ For the animate nouns which are misclassified we have, as noted above, the ad- ditional influence of distributional factors and data sparseness. It is therefore more difficult to find any clear patterns in the misclassified nouns. It is interest- ing to note, however, that there are several nouns which recur as errors in the experiments for both Norwegian and Swedish. Among the animate nouns, we find among the highly frequent, misclassified nouns the instances barn ‘child’ and vän ‘friend’ which recurred in the error analyses for the experiments in chapter 6. 137In fact, both of these showed variable annotation in the treebank and were assigned their majority class – inanimate – in the extraction of training data. 160 Acquiring animacy – scaling up 7.3.5.1 Human versus automatic annotation In chapter 6, we investigated gradience of the animacy dimension as evidenced by distributional data for Norwegian. We examined the case of organization nouns in more detail and experimented with supervised and unsupervised learn- ing of a more fine-grained animacy distinction. The annotation study per- formed initially in this chapter, applied an even more fine-grained annotation scheme for animacy to a Swedish corpus sample. We concluded that intermedi- ate categories are infrequent and settled for a binary distinction in the ensuing classification experiments. Even so, the often noted gradience is evident both in human and automatic annotation. The manual annotation for person reference in the Talbanken05 treebank showed inconsistencies for certain instances. We found that these were a re- sult of difficulties in ascertaining denotation and/or reference for the noun in question. For instance, nouns with bleak denotationalac properties, such as element ‘element’, part ‘party’, were assigned varying annotation by the hu- man annotators. We also found examples like soldat ‘soldier’ and student ‘stu- dent’, where denotational properties with respect to animacy are clear, how- ever, where dereferencing properties of the context caused annotation incon- sistency. The main distinction between the classes of human and inanimate was fairly straightforward to apply, however, and the annotation was consis- tent with respect to intermediate categories such as animals and organizations. The human annotators clearly did not have difficulties in assigning animals to the non-person category, as instructed. This is not surprising since these con- stitute a clearly defined category, separate from persons. The experimental results show clear evidence of gradience. In the feature analysis, for instance, we noted that a group of organization nouns were clas- sified under a separate node in the decision tree, which tested for properties compatible with both classes. The error analysis of this section has shown that the inanimate class does not easily incorporate animals on terms of linguistic distribution. It is interesting to note that both the human and automatic anno- tation showed difficulties in ascertaining class for a group of abstract, human- denoting nouns, like individ ‘individual’, motståndare ‘opponent’, kandidat ‘candidate’, representant ‘representative’. These were all assigned to the ani- mate majority class during extraction, but were misclassified as inanimate dur- ing classification. Comparing human and automatic annotation we find that these elucidate different properties of the animacy dimension. If we contrast the type of gra- dience found in the human and the automatic annotation, we may note some differences. The automatic classification deals purely with animacy as a lin- guistic category; i.e. animacy as evidenced in linguistic use. It also per defin- 7.4 Summary of main results 161 intion treats animacy as a denotational category since the data representation abstracts over individual contexts of usage. Human annotators clearly have available world knowledge, and in particular also the animacy categories as on- tological categories (Dahl 2008). As we saw above, animals constitute a fairly clearly delimited ontological category and is not confused with the category of humans, regardless of their linguistic behaviour. Furthermore, the task of human annotation differs from the automatic classification task in that the an- notation is token-level and the influence of specific linguistic contexts clearly influences the annotation. Human and automatic annotation also show a great deal of overlap in the treatment of animacy. The fact that we may, through machine learning based on distributional data from language use, replicate the annotation fairly suc- cessfully shows that animacy is largely a denotational property of nouns and we find that the animacy of a noun influences its linguistic distribution consis- tently and over large amounts of data. 7.4 Summary of main results This chapter has discussed the scaling up of animacy classification. In the fol- lowing, we address the questions posed initially in this chapter. A prerequisite for supervised learning of animacy information is an anno- tated set of instances. We investigated referential and denotational approaches to animacy annotation through an annotation study performed by the author, as well as a corpus study of the annotation for person reference found in the Swedish treebank Talbanken05. These highlight problematic constructions for both types of approaches. In particular, we find that dereferencing construc- tions are problematic for referential approaches, whereas elements with vague or abstract denotational properties are problematic under a denotational ap- proach. We conclude that a denotational approach is to be preferred for lexical acquisition of animacy information based on distributional evidence and that the material in Talbanken05 largely follows a denotational practice, hence, is well suited as training data. We also conclude that the dimension of person reference largely overlaps with animacy, and may be employed to approximate animacy. In chapter 6 we developed a set of motivated features for animacy classifi- cation and we showed that a subset of these proved to be reasonably robust to data sparseness. A question posed initially concerns whether a transfer of the method to Swedish and a different data set is viable. The experimental results in section 7.3.2 indicate that this is indeed the case. By abstracting away from the skewed distribution of the data, as well as data sparseness, we showed that 162 Acquiring animacy – scaling up the robust features SUBJ, OBJ and GEN provide comparable results to those obtained for Norwegian. The features proved to differentiate between the two classes well, resulting in balanced class results around 90% accuracy and F- scores. Two main obstacles have been identified in the scaling of the animacy clas- sification task: data sparseness and a skewed class distribution. As noted ear- lier in chapter 6, data sparseness is bound to be a problem for any method rely- ing on distributional data, so also in the current chapter. We may conclude that these two factors are independent factors, but clearly also interact. We found that a skewed class distribution causes unbalanced class results for non-sparse data (in the >1000 experiments), and we found that data sparseness had detri- mental effects on performance for non-skewed data sets, in the experiments with uniform class distribution. An advantage under the present setting is that we have available a notably larger data set. A key question therefore concerns how properties of the data representation, as well as learner properties should be defined in order to fully capture the information contained in the data and thereby alleviate some of the problems caused by the density and distributional properties of the data set. A general feature space was constructed which took into account both mor- phological and syntactic evidence. Feature importance in classification was analyzed both experimentally, through classification with feature subsets and automatic feature selection, as well as manually, through the manual inspec- tion of decision trees. Whereas the syntactic features were clearly most impor- tant, the morphological features provided useful clues, resulting in a combined effect in terms of performance. We obtain results for animacy classification, ranging from 97.3% accuracy to 94.0% depending on the sparsity of the data. With an absolute frequency threshold of 10, we obtain an accuracy of 95.4%, which constitutes a 50% reduction of error rate. With respect to class, we find that classification of the inanimate class is quite stable throughout the experi- ments, whereas the classification of the minority class of animate nouns suffers from sparse data. It is an important point, however, that it is largely recall for the animate class which goes down with increased sparseness, whereas preci- sion remains quite stable. All of these properties are clearly advantageous in the application to realistic data sets, where a more conservative classifier is to be preferred. An initial comparison between eager and lazy machine learning algorithms highlighted the need for a more discriminating notion of similarity in vector space for the memory-based learner. A parameter optimization stage was there- fore introduced, which gave significant improvements in combination with a general feature space. With optimized lazy learners, we found no striking dif- ferences between the two learning algorithms. In general, it seems that the 7.4 Summary of main results 163 property of data generalization prior to classification is an advantage of the eager learner, given enough data to generalize over. Another advantage is the possibility for manual inspection of the decision trees, a feature which may be exploited in feature and error analysis. Even so, the optimized, memory-based learner in general performed slightly better than the decision trees, however, with few differences being significant. We may therefore conclude that both types of learning algorithms are well suited for animacy classification. Both in the preceding and current chapters we have expressed the underly- ing goal of elucidating properties of the animacy dimension and, in particular, the way in which it influences the linguistic distribution of nouns. This has been accomplished through corpus studies and experimental studies where a focus has been on feature and error analysis. One question with respect to an- imacy has been its gradience, i.e. whether the animacy dimension is a strictly binary one – animate and inanimate – or whether there are elements which have properties of both polarities. This question was addressed above with data both from human annotation and experimental results. Under the assumption of a binary animacy opposition, we showed how annotation inconsistency as well as classification errors provide different perspectives on the gradience of animacy. The fact that the human annotation classified tokens highlighted the influence of the linguistic context on classification, and the problems identified there were largely caused by elements which were denotationally or referen- tially variable. In the case of the automatic classification, a different picture emerges since the task in this case is to abstract over the totality of contexts for a particular noun. With no world-knowledge available, the automatic annota- tion deals strictly with animacy as a linguistic category. In the error analysis, we found that this approach causes the proposed, intermediate categories along the animacy dimension to emerge. Part III Parsing 8 ARGUMENT DISAMBIGUATION IN DATA-DRIVEN DEPENDENCY PARSING Arguments tend to differ along a range of linguistic dimensions. We have ex- amined one of these dimensions, namely animacy, in detail in part II and we found that statistical tendencies in syntactic realization proved to be reliable indicators of animacy. We have discussed the notion of soft, probabilistic con- straints and presented evidence from a range of languages, and across various linguistic subdisciplines, suggesting that argument differentiation is influenced by these types of constraints. In syntactic parsing, argument status is assigned automatically employing various types of information, such as part-of-speech, lexical form etc. Parsing is thus a practical task where argument differentiation is put actively to use and it provides us with a setting where we may study the influence of various types of information in a set of controlled experiments. This chapter introduces data-driven dependency parsing and motivates its choice in the current context. We present a methodology for error analysis of parse results and apply this methodology to the results for Swedish from MaltParser, a data-driven dependency parser. The error analysis sets the scene for the experiments presented in chapter 9, where we investigate the influence of a range of different linguistic properties on argument disambiguation. 8.1 Syntactic parsing Whereas parsing in the general sense provides an interesting task and a con- trolled testing ground for argument differentiation, the types of generalizations which may be arrived at are clearly influenced by properties of the parser. For instance, in a grammar-driven parser, the grammar strictly defines the set of possible output strings and the grammar formalism chosen will also influence the analysis and possibly also the general expressivity. 168 Argument disambiguation in data-driven dependency parsing The parser to be employed in these experiments should be compatible with the general assumptions established in chapters 2 and 3. We want our parser to deal with natural language and to be robust in assigning an analysis to any input string. Implicit in such a choice is a view of grammaticality as not being strictly defined by a grammar. We have discussed the central role of frequency, both from the point of view of linguistic constraints in argument differentia- tion and in terms of the modeling of these constraints as soft in the probabilistic sense. In a data-driven parser, parsing is per definition guided by frequencies in language use. This allows us to make as few assumptions as possible with respect to formulations of constraints on arguments, as well as their interac- tion, in terms of a grammar. Even so, it is clear that the nature of the data on the basis of which parsing is approximated directly determines the analyses constructed. With respect to arguments and argument differentiation, we have tried to make as few theoretical assumptions as possible. In particular, we do not want to commit to a structural definition of argument status. Rather, a view of grammatical functions as primitive notions, hence separated from surface linguistic properties such as linear precedence and morphological realization, enables investigations into mismatches between levels of linguistic analysis. In dependency analysis functional argument structure is separated from struc- tural positioning and formulated as dependency relations. Structural assump- tions are furthermore stripped down to the minimal relation between a head and its dependent, highlighting the link to semantic interpretation.138 8.1.1 Data-driven parsing A distinction is often made between grammar-driven and data-driven parsing, where the former is characterized by a generative grammar which defines the language under analysis and the latter is not (Carroll 2000). This distinction has, however, become less clear-cut due to the extensive use of empirical meth- ods in the field in recent years. Most current parsers are data-driven in the sense that they employ frequencies from language data to induce information to improve parsing. Data-driven parsing may thus be characterized, first and 138It is interesting to note that a recent line of investigation in syntactic parsing with phrase- structural representations has focused on the task of function-labeling (Blaheta and Charniak 2000; Merlo and Musillo 2005), where syntactic function labels are assigned to enrich the phrase-structure trees either in a separate post-processing stage or as an integral part of parsing. The more direct link to semantic analysis is cited as a main motivation for this task (Merlo and Musillo 2005). Although English has been the main language under study thus far, work on function labeling for Spanish highlights the particular importance of this type of information in dealing with languages that are less configurational than English (Chrupała and van Genabith 2006). 8.1 Syntactic parsing 169 foremost, by the use of inductive inference, rather than by the use or dispen- sion of a grammar in the traditional sense (Nivre 2006). The development is also a temporal one, where the early parsers consisting solely of hand-crafted grammars expressed as rules, observably ran into serious difficulties resulting from ambiguity in natural language, see section 2.3.1. As a step towards disambiguation, statistical models have been widely em- ployed in parsing. These models assign probabilities to syntactic structure by decomposing syntactic representations and defining probability distributions over these. Statistical models may thus be employed for parse selection follow- ing purely grammar-driven approaches, since these assign probability scores to the analyses returned by the grammar. Grammars may also be extended stochastically to produce probabilistic versions. The probabilistic extensions of context-free grammars (PCFGs) (Charniak 1996), for instance, define prob- ability distributions over non-terminal nodes, where the probability of a syn- tactic analysis is simply the product of all its subtrees. Most statistical pars- ing models can be viewed as history-based: they decompose the parse tree into a set of parse decisions associated with a certain probability. The partic- ular decomposition chosen is an important component in defining statistical parsing models (Collins 1999). PCFGs display some well-known weaknesses resulting from precisely the independence assumptions made in the statisti- cal model, where the application of a phrase-structural rule depends only on the local subtree to which it applies, disregarding the larger structural context, as well as any lexical dependencies which may hold between elements lower in the tree. Ensuing work has focused on lexicalization of PCFGs (Collins 1996; Charniak 1997, 2000), as well as alternative decompositions to increase context-sensitivity (Collins 1999; Johnson 1998; Klein and Manning 2003). The availability of treebanks has been crucial to the development of data- driven parsing, supplying data for inductive inference in terms of estimation of parameters for statistical parse models or even for the induction of whole grammars, so-called treebank grammars (Charniak 1996). A system for data- driven parsing of a language L may be defined by three components (Nivre 2006: 27): 1. A formal model M defining permissible analyses for sentences in L. 2. A sample of text Tt = (x1, . . . ,xn) from L, with or without the correct analyses At = (y1, . . . ,yn). 3. An inductive inference scheme I defining actual analyses for the sen- tences of any text T = (x1, . . . ,xn) in L, relative to M and Tt (and possibly At). 170 Argument disambiguation in data-driven dependency parsing As we have seen above, the formal model M may consist of a hand-crafted or induced grammar. The inductive inference scheme may consist simply in max- imum likelihood estimation from a corpus, as in the case of PCFGs discussed above. In strictly data-driven approaches, a grammar, whether hand-crafted or in- duced, does not figure at all. Hence, the formal model M is not a grammar and the sample of text Tt is a treebank containing the correct analyses with respect to M, which constitutes the training data for the inductive inference scheme I. Parsing in this respect does not rely on a definition of the language under anal- ysis independently of the input data. Without a formal grammar, data-driven models condition on a rich context in the search for the most probable analysis, hence are clearly history-based. Magerman (1995) describes an early, purely data-driven parser for English which decomposes phrase-structural trees into a set of features (lexical form, part-of-speech tag, structural position etc.) and employs decision trees to score individual decisions during parsing. Collins (1999) shows that decomposition in terms of head-modifier dependencies pro- duces a significantly more accurate parser. The fact that parsing is unconstrained by a grammar gives a very large search space and there are various strategies for making search tractable. Typ- ically, some sort of pruning of the search space is necessary to prevent compu- tation of the probability of all possible parses. Deterministic processing con- stitutes another, very efficient strategy where the probability of each decision is maximized at each deterministic choice-point during the derivation. 8.1.2 Dependency parsing The use of dependency representations, see section 5.1.1, in syntactic parsing has recently received extensive attention in the NLP community (Buchholz and Marsi 2006; Nivre et al. 2007). One of the arguments in favour of parsing with dependency representations is that dependency relations are much closer to the semantic relations which figure between words in a sentence. As automatic parsing often is viewed as a means to a semantic interpretation of a sentence, dependency analysis represents a step in the right direction. We may define the task of dependency parsing informally as the mapping from natural language sentences to well-formed dependency structures. As be- fore, this mapping may be defined explicitly and exhaustively by a grammar, and it may be data-driven to various extents, as discussed above. 8.1 Syntactic parsing 171 8.1.3 Data-driven dependency parsing In data-driven dependency parsing, the formal model M defining permissible analyses is given by a definition of a dependency graph – a labeled acyclic graph with certain properties, see for instance section 5.1.1 above with ref- erences therein. Decomposition of the dependency graph for induction is one point where approaches to data-driven dependency parsing differ and we may distinguish between transition-based approaches, where dependency graphs are decomposed into parse transitions (see, e.g., Yamada and Matsumoto 2003; Nivre, Hall and Nilsson 2004) and graph-based approaches, where depen- dency graphs are decomposed into subgraphs or individual dependency arcs, (see, e.g., McDonald et al. 2005). MaltParser is a language-independent system for data-driven dependency parsing, which is based on a deterministic parsing strategy (Nivre 2003; Nivre, Hall and Nilsson 2004), in combination with treebank-induced classifiers for predicting parse transitions. It allows for explicit formulation of features em- ployed during parsing by means of a feature model and is optimal with respect to incrementality. 8.1.3.1 Parsing strategy The parsing strategy consists in a non-deterministic parsing algorithm which is made deterministic by a parse guide. The parsing algorithm is an adaptation of the shift-reduce algorithm for context-free phrase structure grammars for application to dependency graphs (Nivre 2003). The parsing algorithm in MaltParser constructs parsing as a set of transi- tions between parse configurations. A parse configuration is a triple 〈S, I,G〉, where S represents the parse stack – a list of tokens which are candidates for dependency arcs, I is the queue of remaining input tokens, and G represents the dependency graph defined thus far (Nivre 2006). There are four possible transitions between parse configurations (where top is the token on top of the stack, and next is the next token in the input) (Nivre et al. 2006: 1): SHIFT: Push next onto the stack. REDUCE: Pop the stack. RIGHT-ARC(r): Add an arc labeled r from top to next; push next onto the stack. LEFT-ARC(r): Add an arc labeled r from next to top; pop the stack. The parsing algorithm described above is clearly non-deterministic in allowing for several possible transitions out of most parse configurations. The choice to 172 Argument disambiguation in data-driven dependency parsing make the parsing strategy deterministic is taken primarily on grounds of effi- ciency (Nivre 2006). However, we will see below that it also has the effect that the parser is incremental. The parse guide predicts the next parse action (transition), based on the current parse configuration. The guide is trained employing discriminative ma- chine learning, which recasts the learning problem as a classification problem: given a parse configuration, predict the next transition. Prediction thus relies on a decomposition of the gold standard data set into parse configurations and a feature model which defines the relevant attributes of the configuration for use by the classifier.139 8.1.3.2 Feature model As mentioned above, the parse guide predicts the next parse action based on the current parse configuration.140 The feature model defines the relevant at- tributes of tokens in a parse configuration. There are generally two main types of attributes – static and dynamic. Static attributes are constant and defined by the input to the parser, whereas dynamic attributes are updated during parsing. Examples of static attributes are lexical form and part-of-speech. The feature model in MaltParser also enables the use of dynamic attributes of the depen- dency graph under construction, in particular the dependency relation. Parse configurations are represented by a set of features, which focus on attributes of top, next and neighboring tokens in the stack, input queue and de- pendency graph under construction. Figure 11 shows an example of a feature model which employs the word form (FORM), part of speech (POS), and de- pendency relation (DEP) of a given token. The feature model is depicted as a matrix where rows denote tokens in the parser configuration, defined relative to the stack (S), input queue (I) and dependency graph (G), and columns de- note attributes. Each cell containing a + corresponds to a feature of the model. Examples of the features include part-of-speech for the top of the stack, lex- ical form for the next and previous (next-1) input tokens and the dependency relation of the rightmost sibling of the leftmost dependent of top. 139See Nivre 2006 for details about the derivation of training data. 140To be precise, classification is performed on the basis of equivalence classes of config- urations, where equivalence classes are constructed in terms of parse configurations and the features employed to represent them. The function Φ defines an equivalence relation over prop- erties of configurations and is composed of a set of feature functions which each pick out a certain property of the current configuration (Nivre 2006). 8.1 Syntactic parsing 173 FORM POS DEP S:top + + + S:top+1 + I:next + + I:next−1 + I:next+1 + + I:next+2 + G: head of top + G: leftmost dependent of top + G: rightmost dependent of top + G: leftmost dependent of next + + G: leftmost dependent of head of top + G: leftmost sibling of rightmost dependent of top + G: rightmost sibling of leftmost dependent of top + G: rightmost sibling of leftmost dependent of next + + Figure 11: Feature model for Swedish; S: stack, I: input, G: graph; ±n = n positions to the left(−) or right (+). 8.1.3.3 Training and parsing Parsing with the MaltParser system involves two phases – a training phase and a parsing phase (Nivre 2006). Training involves the extraction of feature vectors from the gold standard data set, and the induction of a parse guide. Parsing proceeds by extraction of feature vectors for every non-deterministic configuration and querying of the parse guide. Classifiers can be trained using any machine learning approach, but the best results have so far been obtained with support vector machines, using LIBSVM (Chang and Lin 2001) with a quadratic kernel K(xi,x j) = (γxTi x j + r)2, see (Nivre et al. 2006) for more de- tail. 8.1.3.4 Incrementality Strict incrementality in parsing involves connectedness at each point during the analysis of the input string. Unlike many other data-driven parsers, MaltParser approaches incrementality. Nivre (2004) shows that while incrementality in the strict sense is not attainable in dependency parsing, the arc-eager parsing algorithm employed in MaltParser is optimal in that it provides a close approx- 174 Argument disambiguation in data-driven dependency parsing imation of incrementality.141 8.2 Error analysis In grammar-driven systems, errors may be directly related to properties of the grammar and error analysis can exploit this more transparent relation in diag- nosing and eliminating errors.142 In many data-driven systems, however, there is no explicit grammar responsible for errors, hence error analysis consists solely in analysis of the relation between the input and output data. A deeper analysis of specific error sources in data-driven parsing is clearly an important step towards a further advancement of the state of the art. The rest of this chapter will be devoted to an in-depth error analysis of argu- ment assignment in a data-driven dependency parser – MaltParser, as trained on the Swedish treebank Talbanken05, see section 5.1.1. We start out by for- mulating a methodology for error analysis which allows us to quantify parser performance with respect to specific dependency relations and specific error types over these relations. The following error analysis will focus on errors in the assignment of argument relations and relate these errors to morphosyntac- tic properties of the arguments, the set of which is bounded by the features employed during parsing and specified explicitly in the feature model. We will attempt to provide answers to the following questions: Error analysis How may we characterize the errors performed by the parser? Argument errors What characterizes the errors in argument assignment? May the errors be related in any consistent way with variation in the linguistic expression of arguments? Generalizations Which types of generalizations may be acquired regarding syntactic arguments in a strictly data-driven setting? Argument disambiguation To what extent may syntactic arguments be dis- tinguished based solely on surface properties like lexical form, morphol- ogy and word order? 141The arc-eager algorithm differs from the standard shift-reduce algorithm for dependency structures in treating right and left dependencies differently, hence coping with chains of right dependencies which requires stacking of waiting head/dependents. Incrementality is then rede- fined as a restriction on connectedness between the stack and the graph under construction. 142We may distinguish between error analysis and error mining, where error analysis has a focus on characterizing (and possibly correcting) errors and makes use of a gold standard to locate the errors, usually in the form of a treebank or a test suite. Error mining, on the other hand, focuses primarily on locating errors, see for instance van Noord 2004. 8.2 Error analysis 175 8.2.1 A methodology for error analysis An error analysis of parse results may typically be characterized by the follow- ing two steps: 1. locate the errors 2. characterize the errors Step 1 involves comparison of the parser output with a gold standard corpus and the application of an evaluation metric. Step 2 is less straightforward and more dependent on the aim of the analysis. We will treat errors as sets of nodes in a dependency structure and characterize these by lexical and structural prop- erties, such as part-of-speech and dependency label. 8.2.1.1 Evaluation metric In the evaluation of dependency parsing, overall parsing accuracy is commonly reported using the standard metrics of unlabeled attachment score (UAS) and labeled attachment score (LAS), i.e., the percentage of nodes that are assigned the correct head without (unlabeled) or with (labeled) the correct dependency label: UAS = # correctly attached tokens # tokens LAS = # correctly attached and labeled tokens # tokens For analysis of performance in the assignment of specific dependency labels, we employ the standard measures of precision and recall, as well as the com- bined, balanced F-score. Note that only correctly labeled and attached tokens are considered as true positives. 8.2.1.2 Error sets and types The dependency relations in the treebank data adhere to the single-head con- straint, see 5.1.1, hence we may equate errors with token nodes in a depen- dency graph with corresponding properties, such as head, dependency relation etc.143 In general, we will treat error analysis as dealing with sets of errors, i.e. 143As Nivre (2006) notes, the single-head constraint allows for the assignment of dependency relations to nodes, rather than arcs, which simplifies the formulation of the labeling performed during parsing. In the same way it allows for error analysis dealing with dependent nodes. 176 Argument disambiguation in data-driven dependency parsing token nodes, and comparisons between parsers as standard set-theoretic oper- ations over these sets. The sets of errors will vary depending on the evaluation metric, but will always constitute the complement set of the set of correct in- stances according to the chosen metric. For the UAS and LAS evaluation metrics defined above, we define the sets of correct instances, correctU and correctL, and the sets of errors, errorU and errorL, as follows: correctU = {x|head of x is correct} errorU = {x|head of x is wrong} correctL = {x|head of x is correct and dependency label of x is correct} errorL = {x|head of x is wrong and dependency label of x is wrong}∪{x|head of x is wrong and dependency label of x is correct} ∪ {x|head of x is correct and dependency label of x is wrong} In comparing sets of errors PA and PB for two parsers, we may examine prop- erties of their intersection, PA∩PB, and difference PA−PB, where: PA∩PB = {x | x ∈ PA and x ∈ PB} PA−PB = {x | x ∈ PA and x /∈ PB}, and PA−PB = PA− (PA∩PB) When parser A is the gold standard data set, we obviously have that PA = {}. In a comparison between two parsers, we wish to locate the differences responsible for a general improvement or deterioration in overall results. We may thus define improvement and deterioration of results for the error set of a new parser PN , compared to that of a baseline parser PBl , as follows: improvement if | PBl |>| PN |. We may then examine properties of the cor- rected errors in the difference set PBl−PN further. deterioration if | PBl |<| PN |. We may then examine properties of the new errors in the difference set PN−PBl further. In characterizing the results we may sort the sets of errors defined above into error types, based on various, relevant properties. With our main objective be- ing an analysis of argument disambiguation, the main focus will be on analysis of labeled results, with a focus on the argument dependency relations. For the analysis of the labeled results we create error types based on dependency rela- tions: • the gold dependency relation in the error (Depgold) 8.2 Error analysis 177 • the gold and system assigned dependency relation in the error (Depgold_Depsys).144 For analysis of unlabeled results, the error types will be given by the part-of- speech of the dependent and/or head: • part-of-speech of dependent in the error (POSdep) • part-of-speech of dependent and gold head in the error (POSdep_POShead) • part-of-speech of dependent and gold head and erroneous system head in the error (POSdep_POShead_POSerr) 8.2.2 Data The data for the error analysis of argument assignment in Swedish was ob- tained by parsing the written part of Talbanken05 with MaltParser.145 We em- ployed the settings optimized for Swedish in the CoNLL-X shared task (Nivre et al. 2006), with the feature model presented in figure 11. As we can see, the features employed during parsing are part-of-speech (POS), lexical form (FORM) and structural properties of the dependency graph under construc- tion (DEP). We employed 10-fold cross validation for training and testing, and the overall result for unlabeled and labeled dependency accuracy is 89.87 and 84.92, respectively.146 8.2.3 General overview of errors Table 8.1 presents a list of the overall most frequent error types (Depgold_Depsys) in the data, sorted by absolute frequency. The most frequent error type, ET_OA exemplified by (131), consists largely of prepositional, post-nominal modifiers which have been analyzed as object adverbials.147 In other words, these errors, as well as the converse OA_ET errors in (132), are prepositional attachment errors. 144Note that it might be the case that Depgold = Depsys when head attachment alone is the source of error. 145All examples in the current chapter and chapter 9 are taken from the written sections of Talbanken05, unless otherwise stated. 146Note that these results are slightly better than the official CoNLL-X shared task scores (89.50/84.58), which were obtained using a single training-test split, not cross-validation. Note also that, in both cases, the parser input contained gold standard part-of-speech tags. 147Object adverbials (OA) are adverbials which are closely related to the verb, much like ob- jects, without necessarily being subcategorized for by the verb. They are predominantly (90.3%) headed by a preposition and the choice of preposition is governed by the verb (Teleman 1974). 178 Argument disambiguation in data-driven dependency parsing (131) Genom through deGaulle deGaulle bröts broke-PASS länken link-DEF med with NATO NATO ‘Through deGaulle the ties with NATO were severed’ (132) På on fredagen friday-DEF disputerar defends Åke Åke Nilsson Nilsson på on avhandlingen thesis-DEF ‘This Friday, Åke Nilsson will defend the thesis’ We also find a range of other adverbial relations among the errors presented in table 8.1. Recall from section 5.1.1 that the annotation in Talbanken05 makes numerous, fine-grained distinctions in adverbial functions (spatial, temporal, modal, comparative etc.). These clearly prove difficult for the parser to repli- cate. Among the most frequent errors, we also find a large group involving the core argument relations of subjects – regular and formal subjects – and direct object. In particular, confusion of the two argument functions of subject and direct object (SS_OO, OO_SS) are among the top ten most frequent error types with respect to dependency assignment. 8.3 Errors in argument assignment In section 5.1.1 we provided an overview of the dependency relations in Tal- banken05 and divided these into argument and non-argument relations. The argument relations were either subcategorized for by the verb or thematically entailed by the verb. We may examine the parse performance for each of the argument relations in terms of the class-based performance measures of preci- sion, recall and F-score, see table 8.2. It is quite clear that there is a direct relation between the frequency of the dependency relation in the treebank and the parser performance. The most fre- quent relations are also the relations for which the parser performs best – SS (90.25), SP (84.82), OO (84.53). Table 8.1 shows the most frequent error types involving argument rela- tions.148 We find frequent error types involving different kinds of subjects (SS, FS , ES), objects (OO, IO) and predicatives (SP). In the following sections we examine these errors in more detail. In order to relate the errors to properties of Scandinavian type languages we briefly examine the realization of these ar- gument relations in Scandinavian in section 8.3.1, supplied with quantitative 148Table 8.1 includes both errors where the gold standard relation is an argument relation and/or where the proposed system relation is an argument relation, since both of these will affect the results for the argument relation in terms of precision and recall, respectively 8.3 Errors in argument assignment 179 Gold System # ET OA 450 SS OO 446 OA ET 410 AA RA 404 AA OA 398 TA AA 372 RA AA 311 OO SS 309 RA OA 308 AA TA 290 FS SS 281 OA RA 270 OA AA 269 SS ROOT 265 AA ET 251 SS FS 133 RA ET 244 SP SS 240 SS DT 238 ET AA 232 PA DT 231 Gold System # SS OO 446 OO SS 309 FS SS 281 SS ROOT 265 SP SS 240 SS DT 238 OO ROOT 221 SS SP 206 DT SS 146 SS CC 137 SP AA 136 SS FS 133 OO PA 126 OO AA 103 OO DT 99 IO OO 97 ES OO 95 ET OO 91 DT OO 90 ROOT SS 86 Table 8.1: 20 overall most frequent error types (left) and 20 most frequent argu- ment error types (right), where SS=subject, OO=object, AA=other ad- verbial, OA=object adverbial, ET=nominal post-modifier, RA=spatial ad- verbial, TA=time adverbial, FS=formal subject, SP=subject predicative, DT=determiner, CC=second conjunct, AA=adverbial, PA=prepositional complement, IO=indirect object, ES=logical subject, ET=nominal post- modifier. data from Talbanken05. We then examine the most frequent error types for argument relations in a bit more detail. Table 8.1 shows that regular subjects and direct objects are commonly confused for each other, hence these will be treated together in section 8.3.2. Section 8.3.3 will examine errors involving formal or expletive subjects, as well as the relation of logical subjects and in sections 8.3.4–8.3.5 we will examine indirect objects and predicative construc- tions, respectively. 180 Argument disambiguation in data-driven dependency parsing Deprel Gold Correct System Recall Precision F-score SS subject 19383 17444 19274 90.00 90.51 90.25 SP subject predicative 5217 4416 5196 84.65 84.99 84.82 OO direct object 11089 9639 11718 86.92 82.26 84.53 IO indirect object 424 276 301 65.09 91.69 76.14 AG passive agent 334 249 343 74.55 72.59 73.56 VO object inf. 121 84 112 69.42 75.00 72.10 ES logical subject 878 562 687 64.01 81.80 71.82 FS formal subject 884 578 737 65.38 78.43 71.31 VS subject inf. 102 47 58 46.08 81.03 58.75 FO formal object 156 70 91 44.87 76.92 56.68 OP object predicative 189 42 112 22.22 37.50 27.91 EO logical object 22 2 3 9.09 66.67 16.00 Table 8.2: Dependency relation performance: total number of gold instances (Gold), system correct (Correct), system proposed (System), recall, precision and F-score 8.3.1 Arguments in Scandinavian In chapter 4, we noted that the expression of grammatical function in Scandi- navian is governed largely by linear order in the clause, as expressed through the fields schemas. The linearizations of grammatical functions in main and subordinate clauses presented in section 4.2.3 are repeated in (133) and (134) below: (133) Linearization of grammatical functions in declarative, main clauses: XP Vfin SUBJ S-ADV Vnon−fin OBJind OBJdir ADV (134) Linearization of grammatical functions in subordinate clauses subj SUBJ S-ADV Vfin Vnon−fin OBJind OBJdir ADV We recall that the initial field is characterized by variation and may be filled by pretty much any constituent (XP), whereas subordinate clauses are non-V2. The interpretation of (133)–(134) is that if core grammatical functions do not occur preverbally, they are predicted to be linearized in this order. As we shall see, however, the argument relations differ with respect to how likely it is that they appear in initial position. As we also recall, morphological marking is not a very reliable indicator of grammatical function, and only a subset of the personal pronouns are marked for case. 8.3 Errors in argument assignment 181 Part-of-speech # % PO pronoun 9549 49.3 N noun 9178 47.4 V verb 405 2.1 AJ adjective 136 0.7 PR preposition 50 0.3 R numeral 43 0.2 other 22 0.1 Total 19383 100.0 Table 8.3: Part-of-speech for subjects (SS) in Talbanken05. 8.3.1.1 Subjects Subjects in Scandinavian are realized by nominal expressions: various types of noun phrases, as in (135)–(136) where the subject is a noun and pronoun, respectively, or subordinate clauses, as in (137) where the subject is a subordi- nate clause. (135) Specialläraren special-teacher-DEF kan can också also komma come till to klassrummet classroom-DEF ‘The special education teacher may also come to the classroom’ (136) De they har have alltså so ansvar responsibility och and omsorg care om for barnen children-DEF ‘So, they have the responsibility to care for the children’ (137) Att that värderingarna value-PL.DEF förändrats changed-PASS är is helt totally säkert certain riktigt correct ‘That the values have changed is almost certainly correct’ Table 8.3 shows the distribution of the various parts-of-speech over the subject relation in the written sections of Talbanken05.149 We find that pronouns and nouns account for 96.6% of the subjects, and 2.1% of the subjects are subor- dinate clauses, listed as ‘verb’ in the overview since verbs are clausal heads in the dependency representation. We also observe that almost half of all sub- jects are expressed by a pronoun, supporting cross-linguistic tendencies in the referentiality of subjects, noted in section 3.4. 149Since verbs are the heads of clauses in dependency grammar, clausal arguments are repre- sented as ‘verb’ in the overviews over parts-of-speech for the different argument relations, e.g., table 8.3. 182 Argument disambiguation in data-driven dependency parsing Before After Total Deprel # % # % # % SS 14958 77.2 4425 22.8 19383 100.0 FS 609 68.9 275 31.1 884 100.0 ES 8 0.9 870 99.1 878 100.0 OO 611 5.5 10478 94.5 11089 100.0 IO 5 1.2 419 98.8 424 100.0 SP 491 9.4 4726 90.6 5217 100.0 Table 8.4: Ordering relative to verb for argument relations in Talbanken05. The linearization of grammatical functions in main clauses places the sub- ject in the Midfield, directly following the finite verb. In table 8.4, we find an overview of the linear position of various arguments with respect to the head verb. For subjects, the head verb is always the finite verb, and we find that only 22.8% of the subjects follow the finite verb (after). These counts include both subjects of main and subordinate clauses and if we restrict our counts to subjects in main clauses, we find slightly more variation; 35.7% of the main clause subjects occur following the finite verb. In chapter 4, we noted that the fields schema does not directly express the tendency for subjects to occur preverbally. Regardless of the status of the clause, this tendency is certainly supported by the data: 77.2% of the total sub- jects occur preverbally, and 64.3% of the main clause subjects. Formal subjects Formal or expletive subjects are characterized by a lack of semantic content. They occupy a subject position, but do not share thematic properties with regu- lar subjects. The Scandinavian languages, like English and the other Germanic languages, enforce a subject requirement, also known as the Extended Pro- jection Principle (Chomsky 1981) and the Subject Condition (Baker 1983), which requires that all declarative main clauses must contain a subject. As a consequence, a formal subject is employed when a thematic subject for various reasons may not occupy a subject position.150 We may discern six general types of constructions where an expletive sub- ject figures in Scandinavian (Teleman, Hellberg and Andersson 1999): 150Note that the reasons for using a formal subjects vary; the thematic subject may be demoted, as in impersonal passives, or prefer a postposed position due to discourse-oriented constraints, as in presentational constructions. 8.3 Errors in argument assignment 183 1. Existential/presentational constructions: (138) Det it finns exists olika different slags sorts barnhem orphanages ‘There are different kinds of orphanages’ 2. Det with oblique subject: (139) För for somliga some räcker is-sufficient det it med with bröstet breast-DEF eller or flaskan bottle-DEF ‘For some it is sufficient with breastfeeding or the bottle’ 3. Extraposed finite or non-finite subordinate clause, as in (140): (140) Det it är is ytterst extremely lätt easy att to gå go ur out-of kyrkan church-DEF ‘It is very easy to leave the church’ 4. Impersonal passive: (141) Det it syndas sin-PASS ofta often utan without tvekan doubt . . . . . . ‘There is often sinning going on, without doubt . . . ’ 5. Weather-verbs or verbs of perception denote an event or state with no clear agent:151 (142) Det it regnar/åskar/snöar rains/thunders/snows ‘It rains/thunders/snows’ (143) Nu now luktar smells det it torkad dried frukt fruit i in källaren cellar-DEF ‘It smells of dried fruit in the cellar’ 6. Cleft constructions: (144) Det it är is hon she som who svarar answers för for de the inre internal relationerna relation-PL.DEF ‘It is she who is responsible for the internal relations’ 151The examples in (142)–(143) are from Teleman, Hellberg and Andersson 1999. 184 Argument disambiguation in data-driven dependency parsing Part-of-speech # % V verb 492 56.0 N noun 329 37.5 PO pronoun 47 5.4 AJ adjective 7 0.8 PR preposition 2 0.2 P participle 1 0.1 Total 878 100.0 Table 8.5: Part-of-speech for logical subjects (ES) in Talbanken05. Formal subjects are distinguished from regular subjects in Talbanken05 and assigned a separate dependency relation (FS). The category of formal sub- ject is employed only in cases where there is also an expressed logical sub- ject (ES) and is exclusively realized as the impersonal 3rd person pronoun det ‘it’.152 The logical subject may be realized by a nominal element or a subor- dinate clause. When it is nominal it is normally realized as an object and is in complementary distribution with regular objects in existentials and imper- sonal passives of transitive verbs. As table 8.5 shows, subordinate clauses are the most common logical subjects (56%), followed by nouns (37.5%) and pro- nouns (5.4%). It is interesting to note that the distribution of part-of-speech over the logical subject function is very similar to that of regular objects, see table 8.6, in contrast to regular subjects, cf. table 8.3. The main criteria for the assignment of the FS and ES dependency relations may be summarized as follows: Expressed logical subject the categories of formal (FS) and logical subject (ES) entail each other (bidirectionally); annotation for FS only when there is an expressed ES. Replacement replacement of the formal subject with the logical subject should result in a grammatical sentence As a consequence of the above criteria, not all of the construction types listed above of are annotated as involving a formal subject. Talbanken05 identify types 1, 3 and some cases of 4 as formal subjects (FS). The first criterion, demanding an expressed logical subject, excludes impersonal passives of in- 152We find a total of 884 formal subjects in the written sections of Talbanken and 99.6% of these are realized as det ‘it’. The remaining three are verbal and must be attributed to annotation mistakes. 8.3 Errors in argument assignment 185 transitive verbs, as well as the weather-type verbs of type 5.153 The second criterion excludes expletive subjects expressed as obliques (type 2) and cleft constructions. These subjects are annotated as regular subjects (SS), rather than formal subjects. The constructions analyzed as containing a formal subject (FS), and conse- quently also a logical subject (ES), are thus the existentials (type 1), the extra- positions (type 3) and impersonal passives of transitive/ditransitive verbs (type 4). 154 8.3.1.2 Objects In section 3.1, we introduced the notion of subjects and objects as the core arguments of a clause, denoting its main participants. Objects are also nominal constituents of the verb and may be headed by pronouns and nouns, as in (145) as well as subordinate clauses, as in (146): (145) Men but människorna people-DEF vill want ha have mera more bostäder housing ‘But the people demand more housing’ (146) Vi we måste must kolla check om whether dubbelröstning double-voting skett happened ‘We have to check whether or not double-voting has taken place’ Table 8.6 shows the distribution of the various parts-of-speech over the direct object relation (OO) in Talbanken05. We find a clear difference from the sub- ject relation in table 8.3, where the proportion of subordinate clauses (verb) constitutes the most striking difference. These account for 18.3% of the direct 153The annotation manual mentions a few exceptions to the criterion demanding an expressed logical subject. In sentences where the subject is the adverbial pronoun här ‘here’, we may get a logical subject analysis without a formal subject (Teleman 1974: 46). In sentences where a clause containing a formal subject is analyzed as modifying the logical subject clause, the subordinate formal subject does not have a corresponding logical subject (Teleman 1974: p. 46). As a consequence, we find that the treebank contains slightly different numbers of elements annotated as formal (FS) and logical subject (ES) – 884 vs. 878 instances, respectively. 154We may note that the replacement described in the replacement criterion above, can be related to the argument status of the subject and is not randomly chosen. Neither is the group of remaining formal subject constructions without certain commonalities. It has been argued several places in the literature that the subject of weather verbs are quasi-arguments which therefore cannot readily be replaced with another argument (Chomsky 1981; Falk 1993). The exclusion of subjects of impersonal passives over intransitive verbs from the group of formal subjects, however, is unfortunate. Clearly, these should also be expletives on a par with their transitively formed counterparts. 186 Argument disambiguation in data-driven dependency parsing Part-of-speech # % N noun 6377 57.5 PO pronoun 2247 20.3 V verb 2034 18.3 AJ adjective 282 2.5 PR preposition 60 0.5 R numeral 34 0.3 P participle 30 0.3 AB adverb 15 0.1 other 10 0.1 Total 11089 100.0 Table 8.6: Part-of-speech for direct objects (OO) in Talbanken05. objects, but only 2.1% of the subjects. Note however, that clausal objects in- clude a wide category of syntactic constituents in the Talbanken05 annotation scheme, including infinitival complements of control and raising verbs.155 We furthermore observe that direct objects are clearly less commonly expressed pronominally (20.3%), than subjects (49.3%). As indicated by the lineariza- tion of the fields analysis in (133)–(134), objects are positioned in the End field, following any finite and non-finite verbs. Objects are often claimed to be in a closer structural relation with the verb than the subject and depending on the valency of the lexical verb, there may be one or two objects – a direct (OO) and an indirect object (IO). The frequencies of ordering with respect to the lex- ical verb in table 8.4 above, clearly show the strong preference for postverbal position in the case of both types of objects. We find that 97.3% of all direct objects and 98.8% of the indirect objects are postverbal (after). Indirect objects are in general much less frequent than direct objects and in Talbanken05 there are only a total of 424 instances. These show a strong pref- erence for pronominal realization, as we see from table 8.7. In section 7.1.2 we also noted that indirect objects show a general preference for animate de- notation. As we also saw in chapter 7, these properties are not unrelated, since animate reference is typically expressed pronominally. There is, however, a complicating factor in the annotation of indirect objects; the reflexive argu- ment of transitive reflexive verbs is annotated as an indirect object, as in (147) below. 155The annotation manual proposes a replacement test for objecthood of subordinate clauses: if the clause can be replaced by a pronoun, e.g. något ‘something’ or detta ‘this’, it is annotated as object. 8.3 Errors in argument assignment 187 Part-of-speech # % PO pronoun 320 75.5 N noun 98 23.1 AJ adjective 5 1.2 ID idiom 1 0.2 Total 424 100.0 Table 8.7: Part-of-speech for indirect objects (IO) in Talbanken05. (147) Barnens children-GEN hus house tänker think jag I mig myself hellre rather av of modell model äldre older villa villa ‘The children’s house I rather imagine as an older villa’ In these types of constructions, the pronominal argument may not be anything but a reflexive pronoun, coreferent with the subject. The reflexive pronoun accounts for 44.4% of all the indirect object instances in Talbanken, which shows that indirect objects are in fact even more infrequent than first assumed. 8.3.1.3 Predicatives Predicatives establish a core argument – a subject or object – as having a cer- tain property, being a member of a certain class of referents, or establishes referential identity. In Scandinavian, predicatives are largely realized by adjectives, participles and nominals (Teleman 1974), as illustrated by the overview of part-of-speech for subject predicatives in table 8.8. The adjectival predicatives agree with the predicated argument in gender, definiteness and number. Subject predicatives are complements of a small set of verbs - vara ‘be’, bliva ‘become’, varda ‘become’, heta ‘named’, kallas ‘be-called’, förefalla ‘seem’, verka ‘seem’, se . . . ut ‘look’ and typically occupy the object position, with which it is in com- plementary distribution, following the finite and possibly non-finite verb(s). (148) Deras their val choice-SG.NEUT av of äktenskapspartner marriage-partner blev became kanske possibly slumpmässigt random-SG.NEUT ‘Their choice of partner was possibly random’ 188 Argument disambiguation in data-driven dependency parsing (149) Per-Ola Per-Ola Larsson Larsson som who är is sekreterare secretary i in organisationen organization-DEF . . . . . . ‘Per-Ola Larsson who is the secretary in the organization . . . ’ (150) Detta this är is EEC-organisationen EEC-organization-DEF i to dag day ‘This is the EEC organization at present’ We may distinguish semantically and referentially between descriptive and identifying predicatives (Teleman, Hellberg and Andersson 1999), where the former classify or characterize the predicated argument – either subject or ob- ject – further, as in (149), whereas the latter establish a relation of strict co- reference, as in (150), i.e. the extension of the two arguments are identical. Identifying predicatives occur only with the copula verbs vara, bli, förbli ‘be, become, remain’, and the nominal predicative is typically definite, as in (150). Descriptive predicatives in contrast, are usually indefinite, and may occur with- out indefinite article, as in (149) above. The use of an evaluative adjective is typical, where the noun is commonly a hypernym of the predicational argu- ment and the main semantic contribution of the predicative is in the informa- tion expressed by the adjective. In the treebank data, we find that the nominal subject predicatives exhibit a preference for indefinite expression; 89.6% of the subject predicatives expressed as nouns are indefinite.156 This indicates that descriptive predicatives are most common in Swedish. Talbanken05 distinguishes subject predicatives (SP), exemplified by (148)– (150) above, and object predicatives (OP), exemplified by (151): (151) Lämna leave aldrig never spädbarn infant-PL ensamma alone-PL hemma home ‘Never leave an infant home alone’ There are 5223 subject predicatives in Talbanken05 and these are distributed across the various parts-of-speech as shown in table 8.8 The object predica- tives are highly infrequent and there are only 190 instances in the treebank. These show similar distributional properties with respect to part-of-speech as the subject predicatives.157 With respect to word order placement, subject predicatives are in the postver- bal position in a clear majority of cases (91.6%), see table 8.4 above. We also 156Out of all nominal (noun or pronoun) subject predicatives, 77.6% are indefinite. We then count all pronouns as definite. 157We find that 41.6% of the object predicatives are adjectives, 28.9% are nouns and 16.8% participles. One difference is in the fact that there are no pronominal object predicatives in the data set. 8.3 Errors in argument assignment 189 Part-of-speech # % AJ adjective 2262 43.4 N noun 1801 34.5 P participle 572 11.0 PO pronoun 280 5.4 PR preposition 149 2.9 V verb 85 1.6 AB adverb 46 0.9 R numeral 20 0.4 other 2 0.0 Total 5217 100.0 Table 8.8: Part-of-speech for subject predicatives (SP) in Talbanken05. observe variation in position for this relation and 9.4% of the subject pred- icatives are located preverbally.158 Object predicatives, on the other hand, are almost exclusively postverbal (99.0%). 8.3.2 Subject and direct object errors We noted above that the two most frequent error types involving argument relations were errors analyzing subjects as objects (SS_OO) and vice versa (OO_SS). Table 8.9 shows an overview of the main error types in the assign- ment of the subject and direct object dependency relations. In addition to the confusion of subjects and objects, which constitutes the most common error type for both relations, we find that both subjects and ob- jects are quite commonly assigned status as the root of the dependency graph (ROOT).159 For both argument relations we also observe error types indicat- ing confusion with other argument relations. For subjects we observe con- fusion with the other main argument functions, such as subject predicatives (SP) and expletive subjects (FS), as well as confusion with determiners (DT) 158This figure may be compared to direct objects which are only found preverbally in 2.7% of the cases in Talbanken. 159The root relation is the default head for all nodes in the dependency graph since the graph is initialized with all nodes attached to the root to ensure connectedness. The dependency graph is thus not guaranteed to be a tree in the technical sense, but is always a set of subtrees attached to the artificial root. These are thus errors where the parser has not located a more appropriate attachment and label. Nominal elements may very well be attached to the root, for instance in sentence fragments which lack a finite verb, but for the error types involving erroneous assign- ment to the root, this is clearly not the case. 190 Argument disambiguation in data-driven dependency parsing Error types for subjects (SS) Gold Sys # SS OO 446 SS ROOT 265 SS DT 238 SS SS 216 SS SP 206 SS CC 137 SS FS 133 SS PA 53 . . . . . . . . . Error types for objects (OO) Gold Sys # OO SS 309 OO ROOT 221 OO OO 149 OO PA 126 OO AA 103 OO DT 99 OO ET 58 OO OA 57 . . . . . . . . . Table 8.9: Error types for subjects (left) and objects (right). and prepositional complements (PA). For objects we observe primarily confu- sion with various adverbial relations (AA, ET, OA), as well as confusion with prepositional complements (PA) and determiners (DT). We may note that con- fusion with DT and PA indicate that a phrasal reading rather than a clausal one has been chosen. Since there is no explicit notion of phrases in a depen- dency analysis, these errors in dependency labeling are primarily errors also in head attachment, as opposed to the errors confusing argument relations. For the SS_OO and OO_SS errors only 19.5% and and 17.8% involve incorrect head assignments, whereas the corresponding proportions for the SS_DT and OO_PA errors are 92.9% and 100%, respectively. There are various sources of errors in subject/object assignment. Common to all of them is that the parts of speech that realize subjects and objects are compatible with a range of dependency relations. Pronouns, for instance, may function as subjects, objects, determiners, predicatives, conjuncts, prepo- sitional objects, etc. In addition, we find “traditional” attachment ambiguity errors, for instance in connection with coordination, subordination, particle verbs, etc. These represent notorious phenomena in parsing, and are by no means particular to Swedish. This language, however, in addition exhibits am- biguities in morphology and word order which complicate the picture further. The confusion of subjects and objects follows from lack of sufficient formal disambiguation, i.e., simple clues such as word order, part-of-speech and word form do not clearly indicate syntactic function. The reason for this can be found in ambiguities on several levels. With respect to word order, we have seen that subjects and objects may both precede or follow their verbal head, but these realizations are not equally likely. Subjects are more likely to occur preverbally, whereas objects typically 8.3 Errors in argument assignment 191 Before After Total Gold System # % # % # % SS OO 103 23.1 343 76.9 446 100.0 OO SS 103 33.3 206 66.7 309 100.0 Table 8.10: Ordering relative to verb for the SS_OO and OO_SS error types. occupy a postverbal position. Based only on the word order preferences dis- cussed above, we would expect postverbal subjects and preverbal objects to be more dominant among the errors than in the treebank as a whole (23% and 6% respectively), since they display word order variants that depart from the canonical, hence most frequent, ordering of arguments. This is precisely what we find. Table 8.10 shows a breakdown of the errors for confused subjects and objects and their position with respect to the verbal head. We find that postverbal subjects (after) are in clear majority among the subjects erroneously assigned the object relation. Due to the V2 property of Swedish, the subject must reside in a position following the finite verb when- ever another constituent occupies the preverbal position, as in (152) where a direct object resides sentence-initially or (153) where we find a sentence-initial adverbial: (152) Samma same erfarenhet experience gjorde made engelsmännen englishmen-DEF ‘The same experience, the Englishmen had’ (153) År Year 1920, 1920, och and först first då, then, fick got den the gifta married kvinnan woman-DEF fullständig complete myndighet rights ‘It was not until 1920 that the married woman recieved full civil rights’ Whereas the postverbal subjects are in a non-canonical position, the preverbal subjects should be easier to locate since their structural position is a strong indicator for subjecthood. As table 8.10 shows, preverbal subjects are also in minority among the errors. However, when preverbal subjects are mistak- enly assigned the object function, it is typically in cases where the parser has not been able to determine their clause-initial position. In subordinate clauses without complementizers, as in (154), this error is indicated also through erro- neous head assignment to the matrix verb instead of the following head verb: (154) På on denna this grund ground tycker think jag I ett a äktenskap marriage ska should byggas build-PASS ‘On these foundations I think that a marriage should be built’ 192 Argument disambiguation in data-driven dependency parsing For the confused objects we find a larger proportion of preverbal elements than for subjects, which is the mirror image of the normal distribution of syn- tactic functions among preverbal elements. As table 8.10 shows, the proportion of preverbal elements among the subject-assigned objects (33.3%) is notably higher than in the corpus as a whole, where preverbal objects account for a miniscule 6% of all objects. The preverbal objects are topicalized elements which precede their head verb, which may be either the matrix verb, as in (155)–(157), or the verb of a following subordinate clause, as the relative clause in (158) below:160 (155) Detta this anser means tydligen apparently inte not Stig Stig Hellsten Hellsten ‘This, Stig Hellsten apparently does not believe’ (156) Vilken which uppfattning opinion har has mannen man-DEF om about kvinnans woman-DEF.GEN ‘rätta ‘right plats’ place’ i in hemmet? home-DEF? ‘Which opinion does the man have about the woman’s place in the home?’ (157) Kärlekens love-DEF.GEN innersta inner väsen nature lär seems inte not något any politiskt political parti party kunna can-INF påverka influence ‘The inner nature of love, it seems that no political party can influence’ (158) Vad what Hellsten Hellsten uppfattar interprets som as något something tryggt safe och and fast, firm, blir becomes . . . . . . ‘What Hellsten interprets as something safe and firm, becomes . . . ’ Contrary to our initial hypothesis, however, we find a majority of postverbal objects among the objects confused for subjects. These objects are interpreted as subjects because the local preverbal context strongly indicates a subject analysis. This includes verb-initial clauses as in (159) where we find a clause- initial imperative or cases of VP coordination, as in (160), as well as construc- tions where the immediate preverbal context consists of an adverbial and the subject is non-local, as in (161) and (162) below. 160Note that raising verbs, like lär in example (157), are analyzed as normal auxiliary verbs, hence the topicalized objects are annotated as dependents of these. 8.3 Errors in argument assignment 193 (159) Glöm forget aldrig never det that löfte promise om of trohet faithfulness för for livet life-DEF ‘Never forget that promise of faithfulness for life’ (160) . . . om . . . if man one tidigare earlier varit been gul yellow i in ögonen, eyes-DEF, haft had gulsot jaundice eller or . . . . . . ‘. . . if one earlier has had yellow eyes, had jaundice or . . . ’ (161) Ungdomarna teenagers blir become med with barn child och and det the sociala social trycket pressure-DEF nästan almost tvingar forces dem them att to gifta marry sig themselves ‘The teenagers become pregnant and social pressure almost forces them to get married’ (162) Eftersom because man one har has full full frihet freedom att to enkelt easily och and snabbt quickly ingå enter äktenskap marriage ‘Because one has the freedom to easily and quickly get married’ The example in (161) is particularly interesting as it violates the V2-property, assumed to be a categorical constraint of Swedish. We may note that the exam- ples in (159)–(162) above indicate acquisition of argument ordering resulting from the V2 requirement; when there is no preverbal argument or when the pre- verbal argument is not a good subject candidate, the argument following the verb is analyzed as subject. Recall, however, that the parser does not have in- formation on tense or finiteness, hence overgeneralizes to examples like (162), where the verb is non-finite. In addition to the word order variation discussed above, Swedish also has limited morphological marking of syntactic function, see section 4.1. Recall that nouns are only marked for genitive case and only pronouns are marked for accusative case. There is also syncretism in the pronominal paradigm. There are pronouns which are invariant for case, e.g. det, den ‘it’, ingen/inga ‘no’, and furthermore may function as determiners. This means that with respect to word form, only the set of unambiguous pronouns clearly indicate syntactic func- tion. We may predict that subject/object confusion errors frequently exhibit elements whose syntactic category and/or lexical form does not disambiguate, i.e., nouns or ambiguous pronouns. Table 8.11 shows the distribution of nouns, functionally ambiguous and unambiguous pronouns and other parts of speech 194 Argument disambiguation in data-driven dependency parsing Gold System Noun Proamb Prounamb Other Total SS OO 324 72.6% 53 11.9% 29 6.5% 40 9.0% 446 100% OO SS 215 69.6% 74 23.9% 9 2.9% 11 3.6% 309 100% Table 8.11: Part of speech for the SS_OO and OO_SS error types – nouns, ambiguous pronouns, unambiguous pronouns and other parts of speech. for confused subjects/objects.161 Indeed, we find that nouns and functionally ambiguous pronouns dominate the errors where subjects and objects are con- fused. Since case information is not explicitly represented in the input, this indicates that case is acquired quite reliably through lexical form.162 The fact that we find a higher proportion of ambiguous pronouns among the objects erroneously assigned subject status indicates that the parser has acquired a preference for subject assignment of pronouns compatible with the difference in frequency for pronominal realization (SSpro 49.2%, OOpro 10.1%163). As discussed earlier, not all of the objects erroneously assigned subject sta- tus are in preverbal position. In fact, a slight majority are still in postverbal position. This is in part due to another type of ambiguity in terms of syntactic category; both nouns and a subset of the pronouns may function as determin- ers (DT). In the postverbal position, word order demands (V2) create clusters of arguments where both a phrasal and a clausal interpretation is possible, i.e. the nouns are erroneously analyzed as modifiers instead of phrasal heads. In (163)–(164) we see examples which illustrate the phenomenon. The subject, expressed either as a noun or a (ambiguous) pronoun, is parsed as a determiner of the following noun and the following argument as the subject of the verb. (163) Naturligtvis naturally knyter attaches barnen children kontakter contacts utanför outside hemmet home-DEF ‘Naturally, the children bond outside the home’ (164) I in ett a äktenskap marriage har has ingen no äganderätt ownership över over den the andre other ‘In a marriage, nobody has ownership over the other’ 161The ‘other’ category consists mainly of verbs (heads of subordinate clauses), adjectives, participles and numerals functioning as nominal heads. 162Since pronouns are a closed class and it is mainly the set of personal pronouns that are marked for case, we would assume that this property can be acquired reliably without large amounts of training data. 163The proportion of pronouns is higher for both subjects and objects if we normalize over only nominal instances, i.e. excluding subordinate clauses: SSpro 50.6%, OOpro 23.7%. Even so, there is a clear difference between the two relations in terms of pronominal expression. 8.3 Errors in argument assignment 195 Error types for formal subjects (FS) Gold System # FS SS 281 FS DT 10 FS OO 6 FS ROOT 3 FS FO 1 FS KO 1 Table 8.12: Error types for formal subjects. The initial error analysis shows that the confusion of subjects and objects con- stitutes a frequent and consistent error during parsing. It is caused by ambigu- ities in word order and morphological marking and we find cases that deviate from the most frequent word order patterns and are not formally disambiguated by part-of-speech information. In order to resolve these ambiguities, we have to examine features beyond part-of-speech category and linear word order. 8.3.3 Formal subject errors Table 8.12 presents the errors in dependency relation assignment performed by the parser for the relation of formal subject. We see that the confusion is almost exclusively with that of the regular subject function (SS). The errors for the function of logical subject, shown in table 8.13, vary over the functions compatible with a postverbal realization (objects (OO), object adverbial (OA), subject predicatives (SP) etc.). This is to be expected since the logical subject may be realized by both nominal phrases and clausal elements. The confusion between regular and formal subjects is clearly caused by the fact that they may be realized by the same word form and may occupy the same structural positions (both pre- and postverbally). After all, formal subjects are subjects in all structural respects.164 The impersonal, third person pronoun det ‘it’, is maximally ambiguous and may occupy a wide range of dependency relations, as table 8.14 illustrates. It occurs in all major argument relations 164One might argue that distinguishing formal subjects from other subjects is unduly com- plicating. It has been shown, however, that whereas more fine-grained dependency labels may affect parsing accuracy negatively, it improves semantic analysis, such as semantic role label- ing (Johansson and Nugues 2007). To the extent that syntactic parsing is not simply a goal in itself, it seems that a more fine-grained analysis is worthwhile, and, in particular, with respect to phenomena like expletive categories which clearly have profound effects on the semantic interpretation of the arguments. 196 Argument disambiguation in data-driven dependency parsing Error types for logical subjects (ES) Gold System # ES OO 95 ES OA 55 ES SP 26 ES ET 23 ES KA 20 ES SS 19 ES ROOT 19 ES PA 15 ES AA 15 ES CC 5 Table 8.13: Error types for logical subjects. Dependency relations of det Deprel Abs % SS (regular subject) 1305 37.5 FS (formal subject) 881 25.3 DT (determiner) 791 22.7 OO (direct object) 226 6.5 HD (head of idiom) 86 2.5 PA (prep. complement) 79 2.3 SP (subject predicative) 33 0.9 . . . . . . . . . Total 3480 100.0 Table 8.14: Dependency relations for the 3rd person pronoun det ‘it’ in Tal- banken05. and also occurs frequently as a definite determiner. The pronoun is not case- marked and hence is formally invariant in all the relations exemplified above. Since formal subjects are structural subjects it is not possible to differentiate the two categories based on structural properties of word order. A relevant question then relates to how the parser manages to recognize any formal subjects at all. We may examine closer the errors performed by the parser and also the instances which were parsed correctly. Since the form does not vary it is clear that the analysis of the pronoun as formal or regular subject will be largely dependent on the relation with and properties of the verbal head. We saw earlier that position relative to the verbal head proved to 8.3 Errors in argument assignment 197 Before After Total Gold System # % # % # % FS FS 401 68.9 181 31.1 582 100.0 FS ¬FS 208 68.8 94 31.1 302 100.0 FS SS 201 71.5 80 28.5 281 100.0 ¬FS FS 100 63.7 57 36.3 157 100.0 Table 8.15: Ordering relative to verb for formal subjects: correctly located (FS_FS) and errors – all (FS_¬FS) and confusion with subject relation (FS_SS). be a contributing factor in the error analysis of regular subjects. As table 8.15 shows, however, there are no clear differences between correctly located for- mal subjects and ones that were not located by the parser or elements which were erroneously assigned the FS relation. These error sets exhibit similar dis- tributions with regard to ordering with respect to the verbal head (before/after) as well as distance (difference between immediately preceding/following and the total before/after bins). The occurrence of a formal subject is very much dependent on properties of the predicate and often reflects the argument structure of the verb. With no other formal or structural clues, we must assume that the interpretation of det ‘it’ as either a formal or a regular subject relies heavily on the lexical form of the verb. We may in addition predict that the correctly located formal subjects will be arguments of a smaller group of verbs which are frequently found with a formal subject, whereas the set of errors will exhibit a more heterogeneous group of verbal heads. Table 8.16 compares the ten most frequent head verbs for the correctly located formal subjects with those of the errors.165 We find that 37.3% of the correctly analyzed formal subjects are actually arguments of the existential predicate finns ‘exists’, as in (138) above. Overall, the set of head verbs for the correct formal subjects is smaller and we find on average 5 instances per head verb type, whereas the corresponding figure for the errors is 2.3. This indicates that the parser acquires lexical generalizations regarding these verbs and their argument structure. We may also note that the percentage of hapax legomena in the set of head verbs for the errors is 30%, but only 11% for the correctly recognized formal subjects – another observation which adds to the difficulty of correct analysis based on frequency for these arguments. The list of head verbs for the correct subjects consists largely of verbs which typically take a formal subject – existential predicates finns ‘exists’, står ‘stands’, complex 165When the verb is the copula är ‘is’, the subject predicate has been included to form a complex predicate of the type vara_svårt ‘be_difficult’. 198 Argument disambiguation in data-driven dependency parsing Head verbs - CORRECT finns ‘exists’ 217 står ‘stands’ 29 vara_svårt ‘be_difficult’ 16 vara_viktigt ‘be_important’ 16 har ‘have’ 13 går ‘goes’ 11 fordras ‘?’ 9 vara_klart ‘be_clear’ 9 kan ‘can’ 8 måste ‘must’ 8 . . . . . . Head verbs - ERRORS kan ‘can’ 19 har ‘have’ 15 blir ‘becomes’ 12 kommer ‘comes’ 10 måste ‘must’ 9 skulle ‘should’ 9 går ‘goes’ 7 vara_plikt ‘be_duty’ 6 ska ‘shall’ 5 skall ‘shall’ 5 . . . . . . Table 8.16: 10 most frequent finite head verbs for the formal subjects in Tal- banken05 – correctly located by the parser (left) and errors (right). copular predicates vara_svårt ‘be_difficult’, vara_viktigt ‘be_important’ etc., whereas the list of head verbs for the errors consists to a large part of functional verbs – auxiliaries (har ‘have’, blir ‘becomes’) and modals (måste ‘must’). In fact, only 9.6% of the head verbs for the correctly assigned subjects are functional verbs, whereas 30% of the head verbs for the errors are modal or temporal auxiliaries. This indicates that our earlier comments on distance to the verbal head are somewhat diffused. Although the distance to the finite verb serving as head for the subject may not be large, the distance to the lexical head verb indicating its argument status is on average longer for the errors than the correctly identified formal subjects. The following picture of the difficulties in assigning the formal subject re- lation emerges: with no formal or structural clues available, the correct identi- fication relies on lexical information regarding the head verb. In order for this information to be employed during parsing, it must represent a reliable source of information – being frequent, fairly unambiguous and available at attach- ment time.166 It is also clear that the analysis of the logical subject relation (ES) relies largely on the correct analysis of the formal subject. Further confu- sion with the object and object adverbial functions is therefore to be expected as a result of error propagation. 166Recall that depending on the presence or absence of a logical subject (ES), the subject of one and the same predicate may be annotated as regular (SS) or formal (FS), respectively. 8.3 Errors in argument assignment 199 Error types for indirect objects IO Gold System # IO OO 97 IO DT 31 IO PA 6 IO ROOT 3 IO SS 2 IO AT 2 IO HD 1 IO AA 1 IO SP 1 Table 8.17: Error types for indirect objects. 8.3.4 Indirect object errors The general trend in the parse results for indirect objects is that precision is high (93%), whereas recall is considerably lower (66%). This means that the parser has difficulties locating candidate indirect objects in general, and chooses an indirect object analysis only if there is clear evidence for it. This evidence, we may assume, is given in part by morphology. It is not surprising that indirect objects should be difficult to locate, as they occupy the postver- bal position examined earlier, hence may be confused with both subjects and objects. However, person denoting indirect objects are marked by accusative case when expressed pronominally, a property which clearly sets them apart from subjects. Differentiation from direct objects, however, is more difficult, as table 8.17 clearly illustrates. We find that confusion with the direct object relation (OO) is the most frequent error type for indirect objects. If we examine the instances which are correctly recognized as indirect ob- jects by the parser, we find that the majority of these consist of case marked pronouns (83%) which to a large part are reflexive pronouns, see 8.3.1 above. With respect to word order, these sentences display the canonical ordering of arguments shown in (133)–(134) in section 8.3.1, to a large extent (87.5%). In these, we find either a preverbal subject or no realized subject at all in the case of subordinate clauses. The direct objects in the set of correct instances are most often realized by noun phrases, hence differ from the indirect object in this respect. Although both the factors of word order and part-of-speech can contribute towards differentiation from the direct object, we must assume that lexical properties of the verb are also important. After all, it is a property of the verb that it takes two objects. We find that a set of verbs are recurrent in the 200 Argument disambiguation in data-driven dependency parsing correctly analyzed sentences, most notably the verbs ge ‘give’, lära ‘teach’, skaffa ‘obtain’, fråga ‘ask’, tänka ‘think’ which alone account for around 70% of the sentences. There is on average 5 indirect objects per verb type and only 8.9% of the verbs are hapax legomena. The error types depicted in table 8.17 indicate that the main confusion for indirect objects is with the direct object function. The other prominent error type is confusion with the determiner function. As mentioned earlier, confu- sion with the direct object is not surprising due to the fact that these may occur in the same position. Confusion with the determiner function results from for- mal ambiguity – both pronouns and nouns may function as determiners. Since the following direct object is often a noun, the indirect object is analyzed as a determiner. If we examine the errors performed by the parser along the same parame- ters as above - part-of-speech, word order and verbal properties, we find that the errors contain a lower proportion of pronominal indirect objects (61%), hence more nouns. Since indirect objects typically are pronominal and direct objects nouns, this is one property which contributes to the confusion by the parser. Also, there is a somewhat lower proportion of the unmarked word or- der pattern (70.8%), compared to the set of correctly analyzed indirect objects (87.5%). With a different ordering of arguments, e.g. postverbal subject, the confusion possibilities obviously multiply. The biggest difference from the set of correctly analyzed indirect objects can be found in the lexical heterogene- ity of the verbal head. The same set of five ditransitive verbs which accounted for 70% of the correct sentences here only account for 33% of the sentences and the number of indirect object instances per verb type is 1.6. Half of all the verbs are hapax legomena, indicating that they have never been observed prior to parsing. The above analysis shows that in the analysis of indirect objects, the main difficulty is in in distinguishing it from the direct object. This relies on the in- terplay of several factors. An unmarked word order, differences in nominal re- alization as well as an acquired generalization over the verb and its ditransitive argument structure are all factors which contribute to argument disambigua- tion. 8.3.5 Subject predicative errors Table 8.18 shows the most frequent errors for the dependency relation of sub- ject predicative (SP). We find that the most common error consists in the con- fusion of subject predicatives for subjects. This is not surprising, as these usu- ally accompany each other and may be realized by the same types of con- 8.3 Errors in argument assignment 201 Error types for subject predicatives (SP) Gold System # SP SS 240 SP AA 136 SP ROOT 84 SP OO 65 SP OA 28 SP CC 27 SP DT 25 SP AT 23 SP KA 23 SP PA 16 Table 8.18: Error types for subject predicatives. stituents.167 In parallel, we saw earlier that subjects are often confused for subject predicatives. A clear majority (80.8%) of the SP_SS errors are nominal, i.e. either pro- nouns or nouns. More than half of these are in preverbal position, a distribu- tion which clearly deviates from that of the corpus as a whole. We observe that the percentage of indefinite nominals is lower for these errors (53.6% of the nominal SP_SS errors are realized by an indefinite noun), a property which may be ascribed to their preverbal position. Clause-initial position is usually correlated with given information, hence also exhibits a greater tendency for definite expression. This property, however, makes these arguments difficult to parse correctly and, in particular, to distinguish from subjects. Subjects and nominal subject predicatives are notoriously difficult to differentiate, even for humans, as the annotation manual also makes clear (Teleman 1974: p. 59). The second most common error type in the baseline analysis of subject predicatives concerns the adjectival predicatives in all majority. The confusion of subject predicatives with the regular adverbial function (AA) occurs first and foremost for adjectives and adverbs. These are almost exclusively postverbal (92.6% of the errors). 167Subject predicatives are, however, found without a corresponding subject in infinitival clauses. 202 Argument disambiguation in data-driven dependency parsing Gold System # OO AA 103 ET OO 91 AA SP 82 AA OO 76 TA OO 76 OA OO 71 OO ET 58 OO OA 57 ET SS 56 TA SS 53 SS AA 45 OA AG 38 AA SS 37 SS ET 36 KA OO 35 Table 8.19: 15 most frequent argument/non-argument error types, where OO=object, AA=other adverbial, ET=nominal post-modifier, SP=subject predicative, TA=temporal adverbial, OA=object adverbial, SS=subject, AG=passive agent, KA=comparative adverbial. 8.3.6 Argument and non-argument errors We have in the above sections focused largely on an error analysis of the ar- gument relations and have found that confusion of the various argument re- lations is a common error. We mentioned initially in section 8.2.3 that con- fusion of non-argument relations, and in particular adverbials, is also a com- mon error. In section 2.4, we noted that the distinction between arguments and non-arguments has been proposed to be gradient and probabilistic in nature (Manning 2003). The error analysis also shows that this distinction is not al- ways straightforward, and we find error types where arguments are confused for non-arguments and vice versa. In table 8.19 we find an overview of the 15 most frequent error types involv- ing arguments and non-arguments. We find both error types where arguments are confused for non-arguments, e.g., OO_AA, OO_ET, SS_AA and, a some- what larger group, of error types where non-arguments are confused for argu- ments, e.g., ET_OO, AA_SP, AA_OO, TA_OO etc. First of all, we may note that these error types are not nearly as common as those involving confusion within the groups of arguments and non-arguments. This indicates that the distinction 8.3 Errors in argument assignment 203 is acquired to a certain extent. Moreover, we find that the errors further sup- port defining properties of the groups of arguments and non-arguments given the features employed by the parser. The errors confusing arguments for non- arguments largely involve categorially non-canonical arguments – adjectival and verbal objects and subjects, as in the OO_AA error in (165) below. In a parallel fashion, the non-arguments analyzed as arguments are predominantly nominal, as in the ET_OO and TA_SS errors in (166) and (167) below. (165) Samma same sak case gäller concerns vuxna adults ‘The same goes for adults’ (166) Den that rätten right-DEF har have vi we kvinnor women haft had sedan since . . . . . . ‘We, the women, have had that right since . . . ’ (167) Varje every morgon morning åker travel tre three miljoner million människor people . . . . . . ‘Every morning, three million people travel . . . ’ With respect to word order, we find that preverbal, nominal adverbials are er- roneously analyzed as subjects and postverbal adverbials as objects, indicating acquisition of word order preferences in line with our earlier findings. 8.3.7 Head distance The overview of the distribution of the various argument relations in Scandi- navian showed that they differ in their ordering preferences with respect to the verb. In the error analysis we have seen clear evidence for the acquisition of these preferences in the fact that the errors are largely instances which depart from the most frequent ordering. For subjects, for instance, we have seen that the set of errors contains notably less preverbal elements. Separate from the issue of ordering, however, is the issue of general distance to the head. Given the incremental, deterministic nature of our parser, we may assume that longer dependency arcs will be less accurate and more error-prone (McDonald and Nivre 2007). In order to evaluate the distance factor with respect to the argument rela- tions, we may compare head distance in the sets of correctly parsed arguments with the corresponding sets of errors. Figure 12 shows the proportions of ad- jacent (±1) and close (±1,2,3) dependents with respect to the head in the sets of correct as opposed to errors for the subject (SS), formal subject (FS), 204 Argument disambiguation in data-driven dependency parsing direct (OO) and indirect (IO) objects, as well as subject predicatives (SP) de- pendency relations. First of all, it is clear that the argument relations differ in 0 0.2 0.4 0.6 0.8 1 SS FS OO IO SP Correct Error 0 0.2 0.4 0.6 0.8 1 SS FS OO IO SP Correct Error Figure 12: Proportion of adjacent (±1) dependents (left) and close (±1,2,3) depen- dents (right) in correct versus error sets for argument relations. their preferences with respect to distance to the head, as seen by the general height of the bars. Note however, that since we measure distance in terms of linear position, distance preferences will also be interfused with preference for short expression. For instance, subjects (SS, FS) and indirect objects (IO) have a preference for pronominal realization and are also shown to be highly local. It is not surprising then, that subjects (SS, FS) and indirect objects (IO) show higher proportions of adjacent dependents than direct objects. More impor- tantly however, we observe that the sets of correctly and erroneously analyzed arguments clearly differ in proportions of adjacent and close dependents. We find that their proportion is notably higher in the sets of correct instances for all relations except the formal subjects.168 The difference is most clear in the case of adjacent dependents in figure 12, but the same tendency is also present in figure 12, where we define the set of close dependents to be within a linear distance of ±1,2,3 from the head. 8.4 Setting the scene This chapter has dealt with argument disambiguation in Swedish. We have in- troduced data-driven dependency parsing and argued that it provides a frame- work for studying the effect of frequency-derived constraints in argument dif- ferentiation. Data-driven parsing has the advantage that syntactic analysis is directly conditioned on properties of the data, and, in particular, on frequency of language use. Dependency analysis provides a framework which maintains 168Recall that formal subjects are realized as pronouns and positioned either immediately preceding or following the finite verb. 8.4 Setting the scene 205 a separate level of grammatical functions, which enables the acquisition of lin- guistic constraints on grammatical functions, rather than structural position. The system employed for data-driven dependency parsing, MaltParser, makes it possible to constrain the information employed during analysis, hence pro- viding an experimental setting where we may study the effect of different fac- tors. With the general aim of studying the generalizations which may be acquired during data-driven dependency parsing of Swedish, we have argued that an in- depth error analysis can provide some of the answers. The errors may be char- acterized by properties like part-of-speech and dependency relation into error types. By comparing tendencies in sets of correctly and erroneously analyzed arguments, we have approached a characterization of argument errors. In par- ticular, we have seen that confusion of argument relations are a frequent type of error, for instance the error types SS_OO, FS_SS, IO_OO discussed above. Based purely on frequency in the training data, a range of generalizations re- garding the structural and formal preferences of the argument relations have been shown to be reliably acquired. Some of these acquired generalizations include: • canonical ordering of arguments, e.g. subjects typically precede the verb and objects follow. • verb second (V2) • lexical/formal preferences: – case preferences for pronouns, e.g. indirect objects are accusative – verbal subcategorization, e.g. ditransitive verbs, existential predi- cates • categorial preferences: – the core argument relations are nominal – tendencies in referentiality, e.g. subjects and indirect objects are also more likely to be pronominal than direct objects • main distinction between arguments and non-arguments The error analysis has also made clear that morphology and word order do not provide sufficient evidence for argument disambiguation in all cases. The errors in argument assignment may often be attributed to global or local am- biguities caused by word order variation and lack of morphological marking. We have also seen that degree of head locality is a factor in the remaining 206 Argument disambiguation in data-driven dependency parsing errors. Clearly, local properties of the arguments are even more important in an incremental, deterministic setting. The above error analysis thus sets the scene for further investigations into argument differentiation through a set of experiments where the influence of linguistic features discussed earlier in this thesis may be explicitly evaluated in terms of argument disambiguation in a data-driven dependency parser. 9 PARSING WITHLINGUISTIC FEATURES Despite the dramatic improvement in accuracy for data-driven parsers in recent years, we still have relatively little knowledge about the exact influence of data- derived features on the parsing accuracy for specific linguistic constructions. There are a number of studies that investigate the influence of different features or representational choices on overall parsing accuracy, within a variety of different frameworks, (Bod 1998; Megyesi 2002; Klein and Manning 2003; Bikel 2004; Charniak and Johnson 2005). There are also attempts at a more fine-grained analysis of accuracy, targeting specific linguistic constructions or grammatical functions (Buchholz 2002; Carroll and Briscoe 2002; Kübler and Prokic´ 2006). But there are few studies that combine the two perspectives and try to tease apart the influence of different features on the analysis of specific constructions, let alone motivated by a thorough linguistic analysis. In this chapter, we present an in-depth study of the influence of certain lin- guistic features, such as animacy, definiteness, and finiteness, on the parsing accuracy for argument relations. In chapter 8, we saw that characteristics of argument realization in Scandinavian type languages pose special problems for the identification of argument relations due to limited case marking and ambiguous word order patterns. In the following we will experiment with the addition of morphosyntactic and lexical semantic features that approximate the distinguishing properties of the argument functions discussed in chapter 3. We will isolate features of the arguments and the verbal head, as well as combi- nations of these, and evaluate their effect on overall parsing results as well as on argument disambiguation specifically.169 We will address the following questions: Linguistic features How will a set of linguistic features expressing inherent properties of arguments affect argument disambiguation? How will lin- guistic features expressing morphological and semantic properties of the verb affect argument disambiguation? 169A shorter version of the first experiments is found in Øvrelid and Nivre 2007. 208 Parsing with linguistic features Parser To what extent is argument disambiguation dependent on specific prop- erties of the parser? Scalability Are the results scalable? May the linguistic features be acquired automatically? 9.1 Linguistic features Argument differentiation depends on a range of linguistic dimensions, some of which are recurrent in a range of languages and others the result of more language-specific properties of syntax and morphology. In order to test the in- fluence of these features in argument differentiation, we must provide empiri- cal approximations of these dimensions which may be derived from a corpus. In chapter 3, we examined linguistic dimensions which have been claimed to influence argument differentiation across a range of languages and in dif- ferent types of linguistic studies. In particular, we examined animacy, definite- ness and referentiality in detail. The close correlation between animacy and different distinctions in argumenthood has been the subject of large parts of the previous chapters, hence need not be repeated at length here. In Part II we found that syntactic distribution provided a reliable indicator of animacy. In the present context we may also make use of the systematic influence of animacy on arguments and examine the effect of animacy in argument dis- ambiguation. The dimension of definiteness expresses the extent to which a referent is identifiable and unique. It is thus concerned with the status of the referent, either in the linguistic discourse or in terms of cognitive status. With respect to argument differentiation, definiteness is in particular important in distinguishing the external argument, the subject, from other arguments such as objects and subject predicatives. The dimension of referentiality expresses the way in which reference is determined for a linguistic expression, or rather, the extent to which the determination of reference relies on the linguistic con- text. A highly referential element may be referred to by solely relying on con- text and may therefore be referred to with a pronoun, whereas a common noun relies to a greater extent on lexical or denotational semantic knowledge. Refer- entiality is a factor in argument differentiation and in section 8.3.1 we saw that subjects are more likely to be expressed pronominally than objects. Differen- tiation within the group of objects, i.e. between direct and indirect objects, is also influenced by referentiality. Chapter 4 and section 8.3.1 outlined some important properties of Scan- dinavian morphosyntax. We have seen that the structural expression of argu- ments in Scandinavian is characterized by initial variation along with rigid verb 9.1 Linguistic features 209 Linguistic feature Treebank feature animacy person reference definiteness morph. definiteness referentiality pronoun type, part-of-speech finiteness tense case morph. case Table 9.1: Linguistic features and their empirical counterparts. placement. Recall that the V2 constraint requires that the finite verb be the sec- ond constituent of declarative main clauses and finiteness has been claimed to be a defining property of Scandinavian syntax (Holmberg and Platzack 1995; Eide 2008). Even if the morphological marking of arguments in Scandinavian is not extensive or unambiguous, case may distinguish arguments when ex- pressed pronominally. 9.1.1 Empirical approximations In table 9.1 we find an overview of the linguistic dimensions discussed above with their corresponding treebank feature. It distinguishes between the fea- tures discussed in chapter 3, representing soft, cross-linguistic tendencies in argument differentiation, and the more language-specific features of Scandina- vian discussed in chapter 4. We map the linguistic features to a set of empirical features representing information which is found in the annotation of the Tal- banken05 treebank. Recall that the Talbanken05 treebank explicitly distinguishes between person- and non-person referring nominal elements, a distinction which overlaps fairly well with the traditional notion of animacy. See section 7.1.2 for a detailed overview of the information on person reference in Talbanken05. Morpho- logical definiteness is marked for all common nouns in Talbanken05; defi- nite nouns are marked as DD and indefinite nouns are unmarked (Ø). Tal- banken05 contains morphological case annotation for pronouns which dis- tinguishes between nominative (Ø) or accusative case (AA). Common nouns distinguish nominative (Ø) and genitive (GG) case. The morphosyntactic fea- tures which are expressed for the part-of-speech of verb in Talbanken are tense (present, past, imperative, past/present subjunctive, infinitive and supine) and voice (Ø/passive; PA). Pronouns are furthermore annotated with a set of pronominal classes which distinguish between e.g. 1st/2nd person and 3rd person pronouns, reflexive, 210 Parsing with linguistic features reciprocal, interrogative, impersonal pronouns etc. For the third person neuter pronoun det ‘it’ and demonstrative detta ‘this’, the annotation in Talbanken05 distinguishes between an impersonal and a personal or “definite” (DP) usage. The impersonal class includes expletives, as well as pronouns which refer to a preceding clause, as in (168) below. (168) Det that skall shall Bayless Bayless hålla hold reda order på on ‘Bayless will keep track of that’ The impersonal pronominal class is employed for non-referential pronouns.170 The two classes of pronouns have quite distinct syntactic behaviours. The im- personal pronouns never function as determiners (DT), whereas the definite pronouns often do (71.4%). Also, the impersonal pronouns are more likely to function as formal subjects FS (32.4%) than the definite pronoun (1.1%).171 9.2 Experiments with linguistic features The experiments are aimed at investigating argument differentiation in a dis- ambiguation task where frequency drives analysis. There is no explicit formu- lation of constraints, rather syntactic analysis is constrained directly by fre- quency of language use. In chapter 8 we found that a set of features expressing only word form, part- of-speech and preceding analysis served to guide the acquisition of general patterns of argument realization discussed in chapter 4. However, the observed errors in argument assignment were caused in part by ambiguities precisely in lexical form, morphological marking and word order patterns. We will here examine the effect of additional differentiating properties of arguments.172 9.2.1 Experimental methodology The main goal of the experiments is to evaluate the effect of the linguistic features discussed in section 9.1 on argument disambiguation. The experi- mental setup should therefore enable us to isolate the effects of different fea- 170Note that we here employ ‘referential’ in a narrow sense, which only includes reference to entities. The category of ‘non-referential pronouns’ consequently includes pronouns which do not refer, i.e., expletives, as well as pronouns which refer to propositions. 171The fact that a few definite/referential pronouns are annotated as formal subjects (15 in- stances, 1.1%) and that a few impersonal pronouns function as determiners (11 instances, 0.4%) must be assumed to stem from annotation error. 172All examples in the current chapter are taken from the written sections of Talbanken05. 9.2 Experiments with linguistic features 211 FORM POS DEP FEATS S:top + + + + S:top+1 + I:next + + + I:next−1 + + I:next+1 + + + I:next+2 + G: head of top + + G: leftmost dependent of top + G: rightmost dependent of top + G: leftmost dependent of next + + + G: leftmost dependent of head of top + G: leftmost sibling of rightmost dependent of top + G: rightmost sibling of leftmost dependent of top + + G: rightmost sibling of leftmost dependent of next + + Figure 13: Extended (FEATS) feature model for Swedish; S: stack, I: input, G: graph; ±n = n positions to the left(−) or right (+). tures, as well as evaluate and compare them. We take the parser evaluated in chapter 8 as our baseline system against which we compare and quantify improvement/deterioration in results. All experiments are performed using 10- fold cross-validation for training and testing on the entire written part of Tal- banken05. Recall from section 8.1.3 that the feature model of MaltParser defines the attributes employed to describe the parse configurations at each point during parsing. In order to incorporate information on our linguistic features we there- fore extend the feature model with an additional, static attribute, FEATS. The extended version of the feature model is depicted in figure 13, including all four columns.173 What is varied in the experiments is thus only the informa- tion contained in the FEATS features (animacy, definiteness, etc.), while the tokens for which these features are defined remain constant. This provides a controlled setting for the testing of our linguistic features. Note that the FEATS features added in this way, like the POS features inherited from the baseline parser, are initially taken from the gold standard annotation in the treebank, which means that the results may give an over-optimistic view of the accuracy that can be expected when parsing new text. We will return to this point later in this chapter. 173Preliminary experiments showed that it was better to tie FEATS features to the same tokens as FORM features (rather than POS or DEP features). Backward selection from this model was tried for several different instantiations of FEATS but with no significant improvement. 212 Parsing with linguistic features In terms of evaluation we wish to be able to quantify the effect of the added features, as well as perform more in-depth error analyses. Evaluation will be performed at the different levels of analysis established in section 8.2.1 – over- all accuracy employing labeled/unlabeled attachment scores, performance per dependency relation, as well as overview in terms of error types. Statistical significance is checked using Dan Bikel’s randomized parsing evaluation com- parator.174 175 Since the main focus is on argument analysis, significance test- ing and error analysis will focus on labeled accuracy, unless otherwise stated. We report accuracy for specific dependency relations, measured as a balanced F-score. In order to summarize improvement with respect to dependency re- lation assignment when comparing two parsers, we rank the relations by their frequency-weighted difference of F-scores.176 In the error analysis in chapter 8, we examined various error types in terms of confusion classes of dependency relations. We will employ two different comparative measures to compare parsers with respect to specific error types: (i) the difference in total number of errors of a certain type for the compared parsers, and (ii) the number of corrected or newly added errors. In set-theoretic terms, with respect to the set of errors for a baseline parser PBl and a new parser PN , the two measures are defined as follows: (i) | PBl | − | PN | (ii) | PBl−PN | or | PN−PBl | Whereas the former measure compares the overall tendency of a parser to make a certain type of error, the latter allows us to compare the parsers’ performance for the same set of errors that were examined during the initial error analysis and specifically targeted in the experiments. Examining sets of corrected or newly added errors provides us with more detailed information regarding the effect of the added information. Total number of errors are presented in the form of confusion matrices, see, e.g., table 9.4 on page 215, where we give the total number of occurrences of each error type for the baseline parser, together 174http://www.cis.upenn.edu/∼dbikel/software.html 175The main idea in randomized parsing evaluation is that given a null hypothesis of no differ- ence between two sets of results, shuffling the results from one system with those of the other should produce a difference in overall results equal to or greater than the original difference, since the individual scores then should be equally likely. If the performance between two sets differ significantly, on the other hand, the shuffling of the predictions will very infrequently lead to a larger performance difference. The shuffling is iterated 10,000 times and the total number of differences in results equal to or larger than the original is recorded. The relative frequency of the number of differences is then interpreted as significance of the difference. 176For each dependency relation, the difference in F-scores is weighted by the relative fre- quency of the dependency relation, Deprel∑i Depreli , in the treebank. 9.2 Experiments with linguistic features 213 Unlabeled Labeled NoFeats 89.87 84.92 Anim 89.93 85.10 Def 89.87 85.02 Pro 89.91 85.04 Case 89.99 85.13 Verb 90.15 85.28 ADPC 90.17 85.45 ADPCV 90.42 85.73 All 90.73 86.32 Table 9.2: Overall results expressed as unlabeled and labeled attachment scores. with the percentage of each error type out of all errors for the dependency relation. For the extended parsers, we give total numbers (#) along with the relative improvement compared to the baseline (%). 9.2.2 Animacy As table 9.2 shows, the addition of information on animacy for nominal ele- ments causes an improvement in overall results (p<.0002). We find that the added information has the greatest effect on the labeling of dependency re- sults, rather than attachment.177 The improvement may be summarized per de- pendency relation as in table 9.3 where the dependency relations are ranked by their frequency-weighted difference of balanced F-scores in order to indicate their relative impact in the improvement of the parse results. The subject and object functions are the dependency relations whose as- signment improves the most when animacy information is added. We also find a small improvement for indirect objects, F-scores improve from 77.2 to 78.8.178 Furthermore, there is an effect in accuracy for a range of other func- tions where animacy is not directly relevant, but where the improved analysis of arguments contributes towards correct identification, e.g., adverbials and determiners. If we take a closer look at the individual error types involving subjects and objects in table 9.4, we find that the addition causes a reduction of errors con- fusing subjects with objects (SS_OO), determiners (SS_DT) and subject pred- 177The difference in unlabeled results is not statistically significant. 178Since indirect objects are quite infrequent in the treebank, this dependency relation is not included in the ranked list in 9.3. 214 Parsing with linguistic features Freq NoFeats Anim SS 0.1105 90.25 90.81 OO 0.0632 84.53 85.04 DT 0.1081 94.14 94.48 TA 0.0249 70.29 71.07 PA 0.1043 94.69 94.81 CC 0.0343 78.02 78.34 ++ 0.0422 90.33 90.52 OA 0.0305 70.63 70.84 FO 0.0009 56.68 63.81 AT 0.0441 95.76 95.90 Freq NoFeats Def SP 0.0297 84.82 85.59 OO 0.0632 84.53 84.84 DT 0.1081 94.14 94.30 SS 0.1105 90.25 90.38 AA 0.0537 68.70 68.91 PA 0.1043 94.69 94.77 TA 0.0249 70.29 70.57 AN 0.0057 39.42 40.64 +F 0.0099 52.07 52.64 UK 0.0305 93.17 93.30 Table 9.3: 10 most improved dependency relations with added information on ani- macy (left) and definiteness (right), ranked by their weighted difference of balanced F-scores. icatives (SS_SP) – all functions which do not exhibit the same preference for human reference as subjects. For the SS_SS error type, we do not find an im- provement.179 The set of corrected errors shows an effect of acquired animacy preferences. For instance, all corrected indirect objects (20 instances, 13.5% of the baseline errors) are human. The added information corrects 15.6% of the baseline er- rors for the SS relation and 60% of these are refer to humans. The influence of the animacy information is clear also if we examine the individual error types. For subjects, we find an improved disambiguation from non-argument rela- tions such as adverbials and determiners, as mentioned above. For instance, the corrected errors of the SS_AA error type, which account for 28.9% of the baseline errors, are all human, and the corrected errors for the SS_DT error type (21.4%) show a a clear majority of 84.3% human elements. With respect to the confusion of argument relations, we also observe the influence of the added information in the error sets. The percentage of corrected errors for the SS_OO and OO_SS error types, compared to the baseline parser, are 21.1% and 28.5%, respectively, when adding information on animacy. Whereas 57.4% of the corrected subjects of this error type are human, only 19.3% of the corrected objects are. 179Labeled attachment scores require both attachment and labeling to be correct, hence we find error types like SS_SS, where only the head attachment is incorrect. 9.2 Experiments with linguistic features 215 Confusion matrix for subjects (SS) NoFeats Anim Def Pro Case Verb ADPC ADPCV All sys # % tot. # % # % # % # % # % # % # % # % OO 446 23.0 388 13.0 425 4.7 401 10.1 419 6.1 365 18.2 361 19.1 293 34.3 296 33.6 ROOT 265 13.7 270 -1.9 284 -7.2 275 -3.8 277 -4.5 260 1.9 269 -1.5 266 -0.4 241 9.1 DT 238 12.3 196 17.6 230 3.4 218 8.4 205 13.9 239 -0.4 164 31.1 160 32.8 160 32.8 SS 216 11.1 222 -2.8 214 0.9 202 6.5 198 8.3 161 25.5 217 -0.5 166 23.1 153 29.2 SP 206 10.6 203 1.5 187 9.2 198 3.9 201 2.4 216 -4.9 188 8.7 187 9.2 195 5.3 CC 137 7.1 135 1.5 123 10.2 139 -1.5 139 -1.5 122 10.9 120 12.4 114 16.8 98 28.5 FS 133 6.9 141 -6.0 148 -11.3 148 -11.3 154 -15.8 151 -13.5 147 -10.5 153 -15.0 155 -16.5 PA 53 2.7 53 0.0 43 18.9 43 18.9 37 30.2 49 7.5 25 52.8 22 58.5 26 50.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Confusion matrix for objects (OO) NoFeats Anim Def Pro Case Verb ADPC ADPCV All sys # % tot. # %) # %) # %) # %) # %) # %) # %) # %) SS 309 21.3 263 14.9 288 6.8 280 9.4 273 11.7 259 16.2 251 18.8 215 30.4 212 31.4 ROOT 221 15.2 239 -8.1 224 -1.4 237 -7.2 229 -3.6 218 1.4 251 -13.6 245 -10.9 241 -9.0 OO 149 10.3 153 -2.7 151 -1.3 148 0.7 146 2.0 143 4.0 143 4.0 141 5.4 141 5.4 PA 126 8.7 122 3.2 129 -2.4 123 2.4 112 11.1 123 2.4 111 11.9 109 13.5 105 16.7 AA 103 7.1 94 8.7 97 5.8 92 10.7 106 -2.9 102 1.0 96 6.8 95 7.8 74 28.2 DT 99 6.8 95 4.0 94 5.1 99 0.0 85 14.1 99 0.0 81 18.2 70 29.3 72 27.3 ET 58 4.0 54 6.9 61 -5.2 57 1.7 59 -1.7 64 -10.3 49 15.5 49 15.5 49 15.5 OA 57 3.9 59 -3.5 58 -1.8 58 -1.8 57 0.0 65 -14.0 63 -10.5 66 -15.8 64 -12.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 9.4: Confusion matrices for the assignment of the subject and object depen- dency relations for the baseline parser (columns 2–3) and for the extended feature models (columns 4–11). 9.2.3 Definiteness The addition of information on definiteness during parsing causes a significant improvement of overall results (p<.02). The dependency relation for which we observe the largest improvement is the subject predicative relation (SP), as shown in table 9.3. As we recall from section 8.3, subject predicatives are often confused with subjects, see table 9.4, and vice versa, see table 9.5. Predicatives in Swedish usually stand in a classifying relation to the subject, where the subject is es- tablished as being an instance of a class of some kind. As a consequence, the predicative is often denoted by a nominal expressing generic reference and typ- ically realized by an indefinite noun phrase. If we examine the set of corrected errors compared to the baseline, we find that the added information causes a 14.2% reduction of the SP_SS errors, all of which are indefinite nouns. We furthermore observe an improved performance in the analysis of in- direct objects (IO) with an F-score improvement from 77.2 to 78.7, see the 216 Parsing with linguistic features Confusion matrix for subject predicatives (SP) NoFeats Anim Def Pro Case Verb ADPC ADPCV All sys # % tot. # % # % # % # % # % # % # % # % SS 240 30.0 231 3.8 229 4.6 237 1.2 231 3.8 240 0.0 213 11.2 213 11.2 208 13.3 AA 136 17.0 140 -2.9 123 9.6 126 7.4 127 6.6 131 3.7 127 6.6 129 5.1 131 3.7 ROOT 84 10.5 86 -2.4 83 1.2 85 -1.2 80 4.8 86 -2.4 82 2.4 77 8.3 76 9.5 OO 65 8.1 71 -9.2 63 3.1 73 -12.3 66 -1.5 72 -10.8 70 -7.7 64 1.5 67 -3.1 SP 34 4.2 35 -2.9 32 5.9 34 0.0 35 -2.9 30 11.8 35 -2.9 32 5.9 30 11.8 OA 28 3.5 25 10.7 25 10.7 25 10.7 27 3.6 27 3.6 28 0.0 26 7.1 31 -10.7 CC 27 3.4 25 7.4 29 -7.4 27 0.0 27 0.0 25 7.4 27 0.0 25 7.4 11 59.3 DT 25 3.1 29 -16.0 24 4.0 22 12.0 25 0.0 28 -12.0 20 20.0 22 12.0 21 16.0 AT 23 2.9 22 4.3 20 13.0 24 -4.3 22 4.3 24 -4.3 20 13.0 22 4.3 22 4.3 KA 23 2.9 17 26.1 22 4.3 21 8.7 26 -13.0 21 8.7 21 8.7 21 8.7 19 17.4 PA 16 2.0 16 0.0 14 12.5 16 0.0 9 43.8 17 -6.2 8 50.0 7 56.2 9 43.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 9.5: Confusion matrix for the assignment of the subject predicative depen- dency relation for the baseline parser (columns 2–3) and for the extended feature models (columns 4–11). Confusion matrix for indirect objects (IO) NoFeats Anim Def Pro Case Verb ADPC ADPCV All sys # % tot. # % # % # % # % # % # % # % # % OO 97 65.5 95 2.1 85 12.4 94 3.1 94 3.1 97 0.0 89 8.2 96 1.0 97 0.0 DT 31 20.9 31 0.0 33 -6.5 33 -6.5 23 25.8 33 -6.5 21 32.3 25 19.4 19 38.7 PA 6 4.1 5 16.7 5 16.7 5 16.7 6 0.0 5 16.7 5 16.7 5 16.7 5 16.7 IO 4 2.7 4 0.0 4 0.0 3 25.0 4 0.0 4 0.0 4 0.0 4 0.0 4 0.0 ROOT 3 2.0 0 0.0 2 33.3 1 66.7 3 0.0 3 0.0 1 66.7 1 66.7 0 0.0 AT 2 1.4 1 50.0 2 0.0 2 0.0 2 0.0 2 0.0 1 50.0 1 50.0 1 50.0 SS 2 1.4 1 50.0 3 -50.0 3 -50.0 2 0.0 3 -50.0 2 0.0 3 -50.0 3 -50.0 AA 1 0.7 0 0.0 1 0.0 1 0.0 1 0.0 1 0.0 0 0.0 1 0.0 0 0.0 HD 1 0.7 1 0.0 1 0.0 1 0.0 1 0.0 1 0.0 1 0.0 1 0.0 1 0.0 SP 1 0.7 1 0.0 1 0.0 1 0.0 1 0.0 1 0.0 1 0.0 1 0.0 1 0.0 Table 9.6: Confusion matrix for the assignment of the indirect object dependency relation for the baseline parser (columns 2–3) and for the extended feature models (columns 4–11); shows all errors. confusion matrix for this dependency relation in table 9.6. As mentioned in chapter 3, the two objects in a double object construction are typically differ- entiated by several factors, among which definiteness has been shown to be one (Bresnan et al. 2005). 9.2.4 Pronoun type The addition of pronoun type information causes a general improvement in overall parsing results (p<.01), as we can see from table 9.2. The dependency 9.2 Experiments with linguistic features 217 Freq NoFeats Pro SS 0.1105 90.25 90.66 OO 0.0632 84.53 84.99 FS 0.0050 71.31 73.99 PA 0.1043 94.69 94.78 FO 0.0009 56.68 66.18 SP 0.0297 84.82 85.08 AA 0.0537 68.70 68.84 TA 0.0249 70.29 70.59 +F 0.0099 52.07 52.80 UK 0.0305 93.17 93.40 Freq NoFeats NonRef SS 0.1105 90.25 90.61 OO 0.0632 84.53 84.96 TA 0.0249 70.29 71.02 FS 0.0050 71.31 74.22 UK 0.0305 93.17 93.46 AN 0.0057 39.42 40.53 ES 0.0050 71.82 72.80 MA 0.0091 76.14 76.57 ROOT 0.0649 86.71 86.77 FO 0.0009 56.68 60.32 Table 9.7: 10 most improved dependency relations with added information on pronominal class (left) and non-referentiality (right), ranked by their weighted difference of balanced F-scores. relations whose assignment improves the most are, once again, the core argu- ment functions (SS, OO), see table 9.7. We also find a general improvement in terms of recall for the assignment of the formal subject (FS) and object (FO) functions, which are both realized by the third person neuter pronoun det ‘it’, annotated as non-referential in the treebank. 9.2.4.1 Non-referential pronouns In a separate experiment (NonRef), we isolated the property of non-referentiality from the other pronominal classes. In this experiment, only information re- garding non-referentiality was included as an additional feature during pars- ing. The results were slightly lower than the experiment with all information on pronominal class, but not significantly so and these results also differ sig- nificantly from those of the baseline parser (p<.01). We would expect information on non-referentiality to be beneficial in the disambiguation of regular, referential subjects (SS) and formal subjects (FS). The error analysis in 8.3 showed that these are difficult to distinguish by form or word order alone and we found that verbal form was the main indicator for the baseline parser. Moreover, and as a consequence of an improved analysis of formal and regular subjects, we may expect improvement in the disambigua- tion of logical subjects (ES) and direct objects (OO). If we examine table 9.7, we find that isolation of non-referentiality has a clear effect on the analysis of the SS, OO and FS relations. In fact, performance for the FS relation is slightly better in the NonRef experiment than in the ex- 218 Parsing with linguistic features Confusion matrix for formal subjects (FS) NoFeats Anim Def NonRef Case Verb ADPC ADPCV All sys # % tot. # % # % # % # % # % # % # % # % SS 281 91.8 272 3.2 277 1.4 247 12.1 277 1.4 275 2.1 241 14.2 244 13.2 252 10.3 DT 10 3.3 7 30.0 12 -20.0 5 50.0 9 10.0 8 20.0 7 30.0 7 30.0 3 70.0 OO 6 2.0 5 16.7 7 -16.7 5 16.7 5 16.7 3 50.0 5 16.7 3 50.0 5 16.7 FS 4 1.3 7 -75.0 4 0.0 3 25.0 5 -25.0 3 25.0 3 25.0 4 0.0 3 25.0 ROOT 3 1.0 5 -66.7 4 -33.3 2 33.3 4 -33.3 3 0.0 3 0.0 2 33.3 2 33.3 KA 1 0.3 1 0.0 1 0.0 1 0.0 1 0.0 1 0.0 1 0.0 1 0.0 1 0.0 FO 1 0.3 1 0.0 1 0.0 1 0.0 1 0.0 0 0.0 1 0.0 1 0.0 0 0.0 Table 9.8: Confusion matrix for the assignment of the formal subject dependency relation for the baseline parser (columns 2–3) and for the extended feature models (columns 4–11); shows all errors. periment where all pronominal features were included (Pro). The identification of ES also improves. Note, however, that there is no direct mapping between non-referentiality and status as a formal subject. As mentioned earlier in sec- tion 8.3.3, non-referential pronouns in Talbanken05 are annotated as subjects when they are not linked to another argument, i.e., the logical subject. In fact, the most common dependency relation for the non-referential pronouns is SS (48.7%) and not FS (31.4%). Even so, it is clear that the added information con- tributes towards an improved recognition of the formal subject relation. Table 9.8 shows a confusion matrix for the FS relation, where the results for the Non- Ref feature are displayed in column 6. We find a reduction of total number of errors of 12.1% for the FS_SS error type, compared to 10.7% with all pronom- inal features and the set of corrected FS_SS errors are all non-referential. So, even though non-referentiality does not constitute unequivocal evidence for a formal subject analysis, it contributes important information along with the other available features, such as verb form. For the corrected SS_FS errors, a clear majority of these (70%) are referential, as the example in (169) below, where det ‘it’ refers to a narcotic substance: (169) Det it visade showed sig itself vara be vanebildande addictive ‘It turned out to be addictive’ 9.2.5 Case When we employ case information during parsing we find a clear improve- ment in results (p<.0001). However, the improvement is not first and fore- most caused by improvement in assignment of subjects and objects, but rather, 9.2 Experiments with linguistic features 219 Freq NoFeats Case DT 0.1081 94.14 94.71 PA 0.1043 94.69 95.13 SS 0.1105 90.25 90.61 OO 0.0632 84.53 85.11 TA 0.0249 70.29 71.16 SP 0.0297 84.82 85.20 +F 0.0099 52.07 52.70 AN 0.0057 39.42 40.35 VG 0.0302 94.65 94.81 IO 0.0024 76.14 77.88 Freq NoFeats Verb SS 0.1105 90.25 90.88 VG 0.0302 94.65 96.61 ROOT 0.0649 86.71 87.61 OO 0.0632 84.53 85.45 +F 0.0099 52.07 55.08 MS 0.0096 63.35 66.43 UK 0.0305 93.17 93.70 ++ 0.0422 90.33 90.67 AG 0.0019 73.56 80.64 AN 0.0057 39.42 41.69 Table 9.9: 10 most improved dependency relations with added information on case (left) and verb (right), ranked by their weighted difference of balanced F-scores. the assignment of determiners and prepositional complements, see table 9.9. The error analysis in section 8.3 showed evidence that pronominal case pref- erences were acquired through lexical form. However, the error analysis also noted ambiguities between status as phrasal modifier as opposed to phrasal head, e.g. SS_DT, OO_PA. Nouns in Swedish are inflected for genitive case and may then serve as determiners for other nouns. Knowledge that a noun is in genitive case in theory excludes it from having an argument relation, be it a clausal argument relation like subject and object, or a phrasal argument, such as prepositional complements (PA). We find a clear effect of genitive case marking in the improved results. For the determiner relation we find improvements in total number of errors for er- ror types indicating confusion with a range of clausal and phrasal argument relations, e.g. DT_PA (21.1%), DT_SS (17.1%). Clear majorities of the cor- rected errors compared are in genitive case; for instance, 79.3% of the DT_PA errors headed by a noun are in genitive case. We also observe an improvement in the total error counts of 25.5% for the converse PA_DT error type and find that all the corrected PA_DT errors are non-genitive, as in (170): (170) I in dagens todays äktenskap marriage accepterar accept de the flesta most kvinnor women inte not utan without protest protest rollen role-DEF som as en a undergiven subordinate maka spouse ‘In modern marriages most women do not accept a role as a subordinate spouse without protest’ 220 Parsing with linguistic features We may summarize, then, that case information has an effect mostly on ambi- guities between phrasal head relations, e.g., PA, SS, OO, and modifier relations, e.g., DT, with positive effects for the analysis of both types of dependency re- lations. 9.2.6 Verbal features In this experiment, all information available for the verbal category (Verb), i.e. voice and tense, was included during parsing. The addition of morphosyntactic information for verbs causes a clear improvement in overall results (p<.0001), shown in table 9.2. Table 9.9 shows the top ten improved dependency relations with added mor- phosyntactic information for verbs. The added information has a positive effect on the verbal dependency relations – ROOT, MS, VG, as well as an overall ef- fect on the assignment of the SS and OO argument relations. Information on voice also benefits the relation expressing the demoted agent (AG) in passive constructions, headed by the preposition av ‘by’, as in English. The overview of the most common error types for the SS and OO relations, see confusion matrices in table 9.4, indicates that the addition of information on verbal features improves on the confusion of the main argument types – SS_OO, OO_SS, as well as SS_FS. We also find that head attachment of subjects (SS_SS) in particular improves. We know that the subject is always attached to the finite verb in the Talbanken05 analysis, so this is not surprising. If we examine the set of baseline errors for the SS_OO and OO_SS error types, we find that 33.2% (SS_OO) and 37.2% (OO_SS) of these have been corrected with added verbal features. For the SS_OO errors, the corrected cases are almost exclusively postverbal subjects which all follow a finite head verb, as in (152) above and repeated here as (171). The OO_SS errors are also almost exclusively postverbal and a fair number of these (37%) have a non-finite head verb, as in (160), repeated here (172). As we remember, only objects may follow a non-finite verb. Also, among the corrected objects with a finite head, we find quite a few imperative forms, as in (159), repeated as (173). (171) Samma same erfarenhet experience gjorde made engelsmännen englishmen-DEF ‘The same experience, the Englishmen had’ (172) . . . om . . . if man one tidigare earlier varit been gul yellow i in ögonen, eyes-DEF, haft had gulsot jaundice eller or . . . . . . ‘. . . if one earlier has had yellow eyes, had jaundice or . . . ’ 9.2 Experiments with linguistic features 221 Unlabeled Labeled NoFeats 89.87 84.92 Verb 90.15 85.28 Voice 89.81 84.97 Tense 90.15 85.27 Finite 90.24 85.33 Table 9.10: Overall results for experiments with verbal features, expressed as unla- beled and labeled attachment scores. (173) Glöm forget aldrig never det that löfte promise om of trohet faithfulness för for livet life-DEF ‘Never forget that promise of faithfulness for life’ The verbal properties then, are beneficial for the disambiguation of the core argument functions, in addition to aiding the correct identification of several verbal dependency relations. 9.2.6.1 Individual verbal features In order to tease apart the influence of the various verbal features we also per- formed a set of experiments testing individual sets of verbal features. Three ex- periments were run with differing feature sets: only voice information (Voice), only tense information (Tense) and a final experiment where the categories in the tense feature were mapped to a binary distinction between finite and non- finite verb forms (Finite). The last experiment was performed in order to test explicitly for the effect of the finiteness of the verb. Voice The addition of information on voice in isolation has little effect on the results and the overall difference from the baseline is not statistically significant. This is somewhat surprising as voice alternations have such confounding effects on the argument structure and argument realization of a verb. In Swedish, the passive may be expressed by a passive suffix -s on the verb, as in (174) below, or periphrastically (be/become + passive participle), as in (175): (174) I in krig war utförs perform-PASS all all verksamhet business i in ‘skarpladdad ‘sharploaded miljö’ environment’ ‘During war all business is performed in a heavily loaded environment’ 222 Parsing with linguistic features Freq NoFeats Voice OO 0.0632 84.53 85.29 SS 0.1105 90.25 90.40 SP 0.0297 84.82 85.32 TA 0.0249 70.29 70.85 AG 0.0019 73.56 80.00 ES 0.0050 71.82 72.96 AN 0.0057 39.42 40.26 UK 0.0305 93.17 93.32 CA 0.0073 67.66 68.08 FS 0.0050 71.31 71.69 Freq NoFeats Finite ROOT 0.0649 86.71 88.03 SS 0.1105 90.25 90.91 VG 0.0302 94.65 96.42 OO 0.0632 84.53 85.31 +F 0.0099 52.07 55.45 MS 0.0096 63.35 66.63 TA 0.0249 70.29 71.20 AA 0.0537 68.70 69.04 ++ 0.0422 90.33 90.67 NA 0.0422 92.46 93.56 Table 9.11: 10 most improved dependency relations with added information on voice (left) and finiteness (right), ranked by their weighted difference of balanced F-scores. (175) . . . man . . . one kan can bli become vald elected i in en a annan other valkrets constituency ‘One may be elected in another constituency’ If we examine only the verbal, passive predicates, we find that passive predi- cates account for only 9.8% of all verbal predicates in Talbanken.180 The pas- sive suffix is by far the most common mode of expression for passive voice in Swedish and these account for 78.4% of the passive predicates in the treebank. With the addition of information on voice we would expect an improvement for the SS and OO relations in particular, as well as the passive agent relation (AG). Table 9.11 shows the ranked list of improved dependency relations for the Voice experiment. We do find an improved assignment for subjects and objects, as well as the passive agent. The improvement for the AG is in fact notable, with F-scores improving from 73.6% to 80.0%. However, since this dependency relation is infrequent in the treebank, improvement has less effect on overall results and this relation is ranked lower in the list in table 9.11 than the more frequent argument functions. The improvement in analysis of the OO relation is clearly linked to verbal 180We count as verbal predicates all elements annotated as verb (30767 tokens) or verbal participle (715 tokens), i.e., participles which are dependents of a verb. Moreover, we count as verbal passive predicates all verbs annotated as passives (s-suffixed verbs: 2413 instances), and all verbal participles annotated as passives (666 instances out of which 446 have vara ‘be’ as a head verb and 114 have bliva ‘become’). The group of verbal passives does not include passive participles in attributive function, e.g., tecknade symbolbilder ‘drawn pictures’, skalade potatisar ‘peeled potatoes’. 9.2 Experiments with linguistic features 223 argument structure; a passive transitive verb does not take an object whereas its active version does. We find a 14.6% decrease in total number of errors for the OO_SS error type and we find that the corrected errors for this type (26.2% of the baseline errors) consist exclusively of objects of active verbs. We find a parallel improvement for the SS_OO error type, with a 10.9% decrease in total numbers of errors. The added information improves on 18.8% of the baseline errors, and we find that the corrected errors consist almost exclusively of postverbal subjects (93.1%) of passive verbs. Table 9.11 also shows improvement for the subject predicative relation (SP). This is due to the fact that the participle in periphrastic passive constructions is encoded as subject predicative. The fact that the addition of information on voice does not have a great overall effect can be attributed to several factors. First of all and as mentioned above, only around 9% of all predicates are passive in Talbanken, so we are not adding a wealth of new information. In addition, passives are not involved in that many errors performed by the baseline parser. In fact, only 4% of the dependency relation assignment errors involve either a passive head or depen- dent. Hence it seems clear that passives do not pose severe problems for the baseline parser. Also, and touching on a more general problem with verbal features, complex predicates containing a finite auxiliary encode their external argument as a dependent on the auxiliary verb but other arguments as depen- dents of the lexical head verb. In other words, in the analysis of periphrastic constructions with a preverbal subject, the information on voice expressed on the non-finite verb may be interceded by several constituents when the subject relation is assigned and hence not available as a feature of the history. Even so, the above error analysis showed evidence for the effect of voice information for errors relating to argument disambiguation and acquisition of differential argument structures for active and passive verbs. Tense Judging from the previous section, information on tense is responsible for a majority of the improvement in dependency relation assignment observed with the addition of verbal features. An experiment (Tense) was therefore performed with information only on tense. The results in table 9.10 show a significant improvement from the baseline (p<.0001). We find that the property of tense is clearly responsible for the main improvement when we add information on verbal features (Verb). 224 Parsing with linguistic features Finiteness In order to ascertain the influence of finiteness, an additional experiment was performed where the various tense features were mapped to their correspond- ing class of ‘finite’ or ‘non-finite’.181 We see the results in table 9.10 and find a significant improvement from the baseline (p<.0001). It is clear that the simple property of finiteness captures the relevant distinctions shown by the tense fea- tures. In fact, the mapping to a binary dimension of finiteness causes a further improvement of overall results (p<.03), compared to the use of tense features. This clearly supports the central role of finiteness in Scandinavian syntax, and V2-languages in general. Recall that the finite verb provides a fixed position in the positioning and ordering of clausal elements. As table 9.11 shows, the addi- tion of finiteness information causes improved analysis for verbal relations, the core argument relations (SS, OO), as well as non-argument, adverbial relations (TA, AA, NA). 9.2.7 Feature combinations The following experiments combine the different nominal argument features, the nominal argument features with the verbal features, and finally all available grammatical features in Talbanken05. A question for the following is therefore whether or not we will observe a combined effect that improves the best result obtained for individual features. The combination of the argument features of animacy, definiteness, pro- noun type and case (ADPC), as well as the addition of verbal features to this feature combination (ADPCV) causes a clear improvement compared to the baseline and each of the individual feature experiments (p<.0001), see table 9.2). Since the results are better than the individual runs, we may conclude that there is a cumulative effect of the combined information. Table 9.12 shows a ranked list of the dependency relations with the greatest effect on the improved results for the ADPCV experiment. We find an improve- ment for the main argument relations of subjects and objects (SS, OO), the ver- bal relations (MS, VG), as well as for the other functions which improved the most with the individual argument features – determiners (DT), subject pred- icatives (SP) and formal subjects (FS). Table 9.13 shows a ranked list of improved argument relations. We find that the combined features results in improved performance for practically all ar- 181Note that we are not equating tense and finiteness, since there are untensed forms which are still finite, e.g. the imperative (Holmberg and Platzack 1995). Rather we map the present and past tenses, as well as the imperative to the class ‘finite’ and the rest to the ‘non-finite’ class. 9.2 Experiments with linguistic features 225 Freq NoFeats ADPCV SS 0.1105 90.25 91.87 OO 0.0632 84.53 86.38 DT 0.1081 94.14 95.22 PA 0.1043 94.69 95.43 VG 0.0302 94.65 96.53 +F 0.0099 52.07 55.97 SP 0.0297 84.82 86.10 ET 0.0523 76.46 77.03 MS 0.0096 63.35 65.98 FS 0.0050 71.31 74.09 Freq NoFeats All SS 0.1105 90.25 92.10 CC 0.0343 78.02 82.21 OO 0.0632 84.53 86.77 DT 0.1081 94.14 95.22 PA 0.1043 94.69 95.53 AA 0.0537 68.70 70.18 MS 0.0096 63.35 70.26 VG 0.0302 94.65 96.67 TA 0.0249 70.29 72.71 +F 0.0099 52.07 57.08 Table 9.12: 10 most improved dependency relations with combined features (AD- PCV; left) and all features (right), ranked by their weighted difference of balanced F-scores. Freq NoFeats ADPCV SS 0.1105 90.25 91.87 OO 0.0632 84.53 86.38 SP 0.0297 84.82 86.10 FS 0.0050 71.31 74.09 AG 0.0019 73.56 79.75 FO 0.0009 56.68 67.65 ES 0.0050 71.82 73.67 VO 0.0007 72.10 84.72 VS 0.0006 58.75 65.56 OP 0.0011 27.91 30.28 IO 0.0024 76.14 77.09 Table 9.13: Improved argument relations with combined features (ADPCV), ranked by their weighted difference of balanced F-scores. gument relations.182 If we examine the confusion matrices for subjects and ob- jects in table 9.4, we find a reduction of total errors for the SS_OO and OO_SS error types with 34.3% and 30.4% respectively. With respect to the specific errors performed by the baseline parser, we observe a substantial reduction of 44.6% for SS_OO and 46.0% for OO_SS. In the error analysis for the base- line parser in section 8.3, we concluded that word order and morphology does 182The only exception is the relation of logical object (EO) for which there is no change in ac- curacy compared to the baseline results. This is a very infrequent relation with only 22 instances in the treebank. 226 Parsing with linguistic features Before After Total Gold System # % # % # % SS OO 21 10.6 178 89.4 199 100.0 OO SS 15 10.6 127 89.4 142 100.0 Table 9.14: Order relative to verb for corrected SS_OO and OO_SS errors in the AD- PCV experiment. Noun Proamb Prounamb Other Total Gold System # % # % # % # % # % SS OO 144 72.4 23 11.6 18 9.0 14 7.0 199 100.0 OO SS 111 78.2 21 14.8 6 4.2 4 2.8 142 100.0 Table 9.15: Part of speech for corrected SS_OO and OO_SS errors in the ADPCV experiment. not provide sufficient information for argument disambiguation in all cases. In particular, we noted that arguments which depart from the most common or- dering and/or are not morphologically marked are overrepresented among the errors. We concluded that additional linguistic information is needed in order to resolve these ambiguities. In tables 9.14 and 9.15 we examine word order and part-of-speech for the corrected SS_OO and OO_SS errors in the ADPCV experiment. We see that the added information contributes to the reduction of precisely the types of errors which were identified in the error analysis. In par- ticular, improvement is centered in postverbal positions, largely occupied by nouns and case ambiguous pronouns. As we saw in section 5.1.1, Talbanken05 also contains a set of semantic features for other parts-of-speech, like adverbs, conjunctions and subjunctions. When we add the remaining set of linguistic features (All), the results improve further and differ significantly from the ADPCV experiment (p<.0001). As table 9.12 shows, we observe a notable improvement for the conjunct relation (CC) as well as further improvement for argument relations (SS, OO), determiners, verbal relations and adverbials. It is largely the improved analysis for different types of coordinated relations (CC, MS, +F) and adverbials (AA, TA) which is the main difference from the ADPCV experiment. Improved anal- ysis of coordinations and adverbials also influences the analysis of arguments. However, the confusion matrices for the various argument relations show that the added features have a different effect than the linguistic features studied above. For the error types expressing confusion of argument relations, such as SS_OO, OO_SS, SS_SP, FS_SS, we hardly observe any improvement at all. Rather, we find improvements in the error types involving arguments and co- 9.2 Experiments with linguistic features 227 0 100 200 300 400 500 Baseline Def Case Pro Anim Verb ADPC ADPCV All SSOO 0 50 100 150 200 250 300 350 400 Baseline Def Case Pro Anim Verb ADPC ADPCV All OOSS Figure 14: Total number of SS_OO errors (top) and OO_SS errors (bottom) in the various experiments. ordinated elements, such as SS_CC, SP_CC, as well as argument and adverbial relations, such as OO_AA, ES_KA. Figure 14 shows the total number of SS_OO and OO_SS errors in the various experiments and clearly illustrate the observed reduction for this error type with the chosen set of linguistic features, as well as the lack of effect for the remaining features added in the All experiment. This indicates that the initial error analysis and the hypotheses formulated there pro- vided a useful understanding of the problem. The linguistic features presented in 9.1, which were chosen to approximate dimensions of argument differenti- ation and defining properties of Scandinavian morphosyntax, contribute to the task of argument disambiguation. 9.2.8 Selectional restrictions We have seen how additional information on semantic properties of arguments contributes to improved dependency relation assignment. However, argument 228 Parsing with linguistic features Class Restriction AnimSS Selects animate subject AnimOO Selects animate object InanSS Selects inanimate subject InanOO Selects inanimate object Table 9.16: Selectional restriction classes. selection is clearly not only determined by isolated properties of the argument. On the contrary many would claim that the semantic class of the verb and the selectional restrictions posed on the argument by the verb are key properties to understanding the relation between the arguments and the head predicate. As we noted in section 3.5, animacy has figured among the categories used to define selectional restrictions from the very beginning. Later, computational approaches have made reference to more fine-grained semantic classes, usually taken from the English WordNet. In the following we will present some experiments investigating the ad- dition of information on selectional restrictions for verbs which focus on the category of animacy. We assign to verbs a selectional restriction class, see table 9.16, based on their occurrences in the treebank. The goal is thus to enrich the existing treebank annotation and furthermore to employ the extended features during parsing. We will in the following investigate the nature of selectional restrictions as absolute or gradient, as well as their effect on parse results. 9.2.8.1 Data extraction We extract predicate-argument pairs from the treebank and generalize over these to determine the selectional restrictions for the predicate. In order to enable generalizations about verbs, the treebank is lemmatized prior to data extraction, using a lemmatizer for Swedish (Kokkinakis 2001). The selectional restrictions of verbs constrain the semantic properties of ar- guments, operating at the syntax-semantics interface, and we must take into account mismatches in the mapping between syntactic structure and seman- tic predicate-argument relations. In extracting the relevant data, we have to determine for each verb-argument pair i) its semantic predicate, and ii) the grammatical relation between the argument and the predicate. With regard to the first point, the treebank annotation of so-called “verb groups” must be taken into account. These consist of a finite auxiliary or modal verb along with one or more non-finite verbs. The subject argument of a verb 9.2 Experiments with linguistic features 229 group is annotated as a dependent of the finite verb, whereas complements are annotated as dependents of the non-finite lexical main verb. For instance, as the dependency representation in figure 15 for example (176) illustrates, the subject, man ‘one’, is the structural argument of the finite verb, the modal auxiliary kan ‘can’, whereas the direct object, struntsaker ‘trivialities’, is the structural argument of the lexical verb diskutera ‘discuss’: (176) Där there kan can man one diskutera discuss struntsaker trivialities ‘There, one can discuss trivialities’ _ Dar kan man diskutera struntsaker ROOTRA SS VG OO Figure 15: Dependency representation of example (176) The dependency representation of (176) does not indicate that man ‘one’ se- mantically also acts as an argument of diskutera ‘discuss’. In order to locate the lexical main verb in the dependency graph, we pursue each finite verb’s, pos- sibly null, chain of non-finite dependents. For the example sentence in (176) then, we extract the following pairs: diskutera-SS:anim diskutera-OO:inan In addition to determining the semantic predicate, we must also determine the grammatical relation which holds between the predicate and its arguments. In most cases, this corresponds directly to the dependency label of the argument, but in passive constructions the structural assignment of grammatical functions does not directly reflect the semantic relations of the predicate. For instance, in example (177) below, the verb lemma respektera ‘to respect’ should not be recorded as selecting an inanimate subject in this particular case, but rather an inanimate object. (177) Parternas part-DEF.GEN integritet integrity måste must respekteras respect-PASS . . . . . . ‘The integrity of the parties must be respected’ So, in the case of passive verbs, the argument relations are inverted and we record the subject as an object of the verb: respektera-OO:inan 230 Parsing with linguistic features 9.2.8.2 Selectional association We wish to assign to each verb in the treebank a selectional restriction class ex- pressing the semantic restrictions it places on its argument. Following Resnik (1996), we will base our notion of selectional restriction on the selectional as- sociation between a predicate and a semantic class, which allows us to quantify the extent to which a predicate selects for an animate, as opposed to inanimate, subject or object. The selectional association of a verb with respect to a class will determine its assigned selectional restriction class, see table 9.16. As we mentioned in section 3.5, theoretical and computational work on selectional restrictions have differed in their view on selectional restrictions as categorical or gradient. We will base our selectional restrictions on a probabilistic mea- sure, approximated by corpus data, where verbs with categorical constraints on their arguments are simply the verbs with a selectional association of 1. We may thus experiment with different degrees of gradience in the selectional restriction classes by manipulating a threshold of selectional association. Resnik (1996) presents a method for acquisition of selectional restrictions which is based on an information-theoretic approach to verbal argument selec- tion. The approach quantifies the overall extent to which a predicate constrains the semantic class of its arguments as its selectional preference strength.183 The contribution of the semantic class of the argument to the selectional pref- erence strength of the predicate is expressed as the selectional association be- tween a predicate and the particular class. Resnik (1996) looks at selectional restrictions for verbs and their objects and employs the WordNet resource (Fellbaum 1998) for obtaining semantic classes. Since nouns may belong to several classes in WordNet and no sense-tagged corpus is available, the es- timation of frequencies distributes the ambiguity evenly over the nouns of a class. Rather than defining selectional association through the additional notion of selectional preference strength, we define the association of a verb v with an argument class c directly as the conditional probability of the class given the 183Selectional preference strength expresses the amount of information a predicate carries regarding its argument by looking at the difference in the prior distribution of a semantic class (P(c)) in an argument position and the resulting distribution when taking the specific predicate into account, expressed by the conditional probability P(c|pi). More precisely, it is defined as the relative entropy (Kullback-Leibler divergence) between the prior distribution of semantic classes of arguments and the distribution for a particular predicate (verb) (Resnik 1996): S(pi) = D(P(c|pi)|P(c)) = ∑ c P(c|pi)log P(c|pi) P(c) 9.2 Experiments with linguistic features 231 predicate (Manning and Schütze 1999: 293): A(v,c) = P(c|v) Resnik (1996) proposes an estimation of the joint probability P(v,c), which takes into account class ambiguity for individual nouns. However, since we have annotated data with respect to the person/non-person distinction in Tal- banken05, the estimation reduces to the following: P(v,c) = freq(v,c) N where N is the total number of occurrences with respect to an argument type – subject or object, and freq(v,c) is the count of occurrences of a verb with an argument of a certain class. P(v) is estimated as the maximum likelihood estimate freq(v)/∑i freq(vi) expressing the relative frequency of the verb with respect to all verbs. For each verb lemma in Talbanken, selectional association with the seman- tic classes of animate and inanimate is calculated based on the extracted data from the treebank, as detailed above. Verbs are then assigned a selectional re- striction class based on their association with the two classes. In doing so, we set a threshold expressing the level of gradience embodied by the selectional restriction. A threshold of 1.0 will assign a class only to verbs which impose categorical constraints on their arguments, whereas a lower threshold, e.g. 0.9, will allow for some variation. In the following we will experiment with both categorical and gradient selectional restriction classes. Selectional association is calculated separately for subjects and objects. This means that the sets of verbs in the classes for subjects and objects are not disjoint. A transitive verb lemma may be assigned both a subject class (AnimSS/InanSS) and an object class (AnimOO/InanOO). There are thus complex classes for transitive verbs which cover subsets of the simple classes in table 9.16. However, since the parser allows for bundles of individual fea- tures there is a clear advantage in tagging the data for simple classes.184 This choice of annotation will facilitate generalization over the selectional restric- tions of verbs of differing valencies. This will also assure that we do not mix in the notion of subcategorization with that of selectional restriction. 184As we recall from section 8.1.3, the parse guide classification is performed employing sup- port vector machines. Depending on the kernel function chosen, feature combinations are con- structed of size n internally. Here we employ a quadratic kernel function, hence n=2 and pairs of all features are constructed for classification. So for instance, pairs such as AnimSS&HH, will express a selectional restriction class of animate subjects occurring with a nominal with person reference (HH). 232 Parsing with linguistic features Class Types Tokens Examples AnimSS 388 1549 kasta, ‘throw’, katalogisera ‘catalogue’, kissa ‘pee’, klaga ‘complain’, klamra ‘cling’, klandra ‘blame’, klappa ‘pat’ InanSS 333 1824 spegla ‘mirror’, spetsa ‘sharpen’, spoliera ‘spoil’, spricka ‘shatter’, sprida ‘spread’, spruta ‘squirt’, sticka ‘sting’ AnimOO 143 618 adoptera ‘adopt’, akta ‘beware’, aktivera ‘activate’, avskräcka ‘scare’, be ‘ask’, be- fatta ‘involve’, befria ‘free’ InanOO 739 5178 rengöra ‘clean’, reparera ‘repair’, repre- sentera ‘represent’, restaurera ‘restore’, revidera ‘revise’, rikta ‘direct’, riva ‘tear- down’ Table 9.17: Verb lemmas by selectional restriction class with A(v,c) = 1.0. 9.2.8.3 Gradience of selectional restrictions Table 9.17 presents an overview of the verb lemmas and tokens assigned to each of the four classes under a categorical definition of selectional restric- tions, i.e., where A(v,c) = 1.0. As we see from the examples in table 9.17, the selectional restriction classes for subjects cut across valency classes and include both intransitive and transitive verbs. All together, 1181 unique verb lemmas verb were assigned at least one selectional restriction class. This gives us a coverage of 75.0% of the total number of unique verb lemmas in Tal- banken (1573 lemmas in total). As expected, among the verb lemmas not cov- ered by the classification we find verbs which may function as auxiliary verbs (copula, modals etc.) and hence may take any type of argument depending on the non-finite lexical verb. It is therefore not surprising that the classification coverage in terms of verb tokens is low; the auxiliary verbs are, after all, the overall most frequent verbs. As a consequence, only 24.5% (7523 tokens out of total 30767) of the verb tokens in Talbanken receive a selectional restriction class. Another factor which influences the coverage is clearly the fact that the restrictions are categorical and do not allow for any variation. Many verbs are simply not that restrictive with respect to their arguments. For instance, the verb visa ‘show’ may occur with both animate and inanimate subjects, as in (178) and (179) below, as well as animate and inanimate object, as in visa 9.2 Experiments with linguistic features 233 någon ‘show someone’ and (178). As a consequence, this verb is not assigned a selectional restriction class with a threshold of 1.0. (178) Han he skall shall visa show intyg proof ‘He will show proof’ (179) Konsumentprisindex consumer-price-index skall shall visa show prisförändringar price-changes ‘The consumer price index should indicate price changes’ In addition to variation with respect to argument selection, there are also ele- ments of the annotation in Talbanken05 which contribute to the lack of cover- age. First of all, the animacy annotation with respect to collective nouns and organizations contribute to the low coverage, since these are annotated as inan- imate. For instance, in example (180) below, the subject of the verb skriva ‘write’ is the noun länderna ‘countries’ which is annotated as inanimate but clearly employed metonymically to refer to the animate representatives from the countries. (180) . . . som . . . which de the sex six länderna countries-DEF skrev wrote under under den the 25 25 mars march 1957 1957 i in Rom Rome ‘. . . which the six countries signed on the 25th of March 1957 in Rome’ In fact, all other active instances of this verb occur with a human subject.185 Due to the example in (180) then, the verb lemma is not assigned the AnimSS class. Since it is a transitive verb it should in principle be assigned an object class as well, preferably the InanOO class. After all, one usually writes some- thing, not someone. However, it is not assigned an object class since it occurs with a direct object (‘mamma’ ‘mom’) annotated as person referring: (181) När when Bowlby Bowlby skriver write ‘mamma’ ‘mom’ . . . . . . ‘When Bowlby writes ‘mom’. . . ’ With a largely denotational, rather than referential annotation practice, see sec- tion 7.1.2, this is a direct consequence. 185The verb in example (180) occurs with the particle under ‘under’ and one might argue that it should be represented as a separate verb lemma alltogether. A special treatment of particle verbs was not, however, pursued further in the present context. 234 Parsing with linguistic features Threshold Types Tokens 1.00 1181 7523 0.95 1205 9482 0.90 1245 14866 0.85 1278 17195 0.80 1312 18695 0.75 1343 26948 0.70 1354 27951 0.65 1388 29674 0.60 1393 30271 0.55 1397 30314 Table 9.18: Absolute number of classified verb types and corresponding treebank to- kens under various selectional association thresholds t, where A(v,c) = t and c ∈ {AnimSS,InanSS,AnimOO,InanOO}. Table 9.18 illustrates the number of types and tokens that receive at least one selectional restriction class under various thresholds for selectional asso- ciation. Clearly, lowering the threshold allows for more variation in argument selection, hence provides wider coverage. With a threshold of 0.95 we find for instance that the verb skriva ‘write’, which did not receive a class earlier, now receives the class AnimSS. With a lowered threshold to 0.90 it also receives the class InanOO. Above, we also examined the verb visa ‘show’ which exhib- ited quite a bit of variation in terms of selectional restrictions. With a threshold of 0.95 it receives the class InanOO encoding a preference for inanimate ob- jects. The threshold has to be lowered to 0.70, however, for the verb to receive a subject class (InanSS), indicating the lower degree of selectional constraint which this predicate enforces on its subject argument, as exemplified by (179) above. 9.2.8.4 Experiments with selectional restrictions In a set of parse experiments, the information on selectional restrictions for verbs (SR), extracted as detailed above, is included as an additional feature. All experiments except one (SR1.0) also include information on animacy as this is the semantic class relevant for the selectional restrictions. In evaluat- ing the results we will therefore compare the results both to the general base- line (NoFeats), as well as the experiment employing only animacy information (Anim). In the experiments we furthermore vary the selectional association threshold for the verb classes expressing selectional restrictions. 9.2 Experiments with linguistic features 235 Unlabeled Labeled NoFeats 89.87 84.92 Anim 89.93 85.10 SR1.0 89.80 84.91 SR1.0&Anim 89.89 85.06 SR0.95&Anim 89.90 85.05 SR0.90&Anim 89.92 85.04 SR0.85&Anim 89.93 85.04 Table 9.19: Overall results for experiments with selectional restrictions, expressed as unlabeled and labeled attachment scores. The third row of table 9.19 shows the overall parse results with information on categorical selectional restriction class for verbs only (SR1.0), i.e. without the corresponding semantic information for arguments. This experiment was performed in order to observe the effect of the selectional restriction classes in isolation. As we can see the addition has a slightly detrimental, but not sta- tistically significant, effect on overall labeled results and displays a significant dip in unlabeled accuracy (p<0.01). The fourth row of table 9.19 shows the results in the parse experiments employing categorical selectional restrictions along with information on animacy (SR1.0&Anim), and we do not find any significant improvements compared to the Anim experiment. It is clear that the problems caused by the addition of selectional restriction information in isola- tion is not countered by any effects obtained in combination with the semantic information for animacy. Since the addition of information only on selectional restrictions (VerbSR) clearly has some unexpected side-effects which carry over to the other exper- iments, we perform an error analysis of these results. We focus largely on the unlabeled results, which is where we observed a clear deterioration of results. Recall from section 8.2.1 that the set of unlabeled errors is defined solely by error in attachment. We sort these errors into error types based on the part- of-speech of the dependent, as well as the correct and erroneously assigned head. We find that the additional information causes a rise in the number of unlabeled attachment errors for all major types of parts-of-speech except for verbal dependents. Table 9.20 shows the total number of attachment errors for the error types which increase the most with the added information. For nominal elements like nouns (N) and pronouns (PO), we note an in- crease in attachment to the superficial root of the dependency graph (ROOT). As we noted in section 8.3.2, erroneous attachment to the artificial root means that an element has not been attached at all. An increase in root attachment 236 Parsing with linguistic features POSDep POSGold POSSys NoFeats VerbSR PR V N 828 860 -3.9 PO V ROOT 189 212 -12.2 AB V V 324 336 -3.7 N PR ROOT 75 87 -16.0 N N PR 218 229 -5.0 N V ROOT 288 297 -3.1 N N ROOT 199 207 -4.0 AB V ROOT 104 112 -7.7 ++ V N 119 127 -6.7 AJ N V 99 106 -7.1 Table 9.20: Total number of errors in the SR1.0 experiment compared to the NoFeats baseline, along with relative deterioration compared to the baseline (%); sorted by error type (POSdep_POShead_POSerr) and ranked by total dif- ference of deterioration. with the added information on verbal classes thus indicates more restrictive attachments, resulting in a more fragmented analysis. The newly added attach- ment errors for prepositions (PR) and adverbs (AB) are characterized by being adverbial in nature; instead of attachment to the correct verb, an alternative, erroneous site is chosen (noun, another verb etc.). What we observe then, both for the argumental and adverbial elements, is a general resistance to attach to verbal elements, clearly caused by the added information. The assumption that selectional restrictions are categorical has clearly been shown to be too strong. As we saw above, it results in missed generalizations as well as a poor coverage for the enriched annotation. Parse experiments were therefore performed testing three different thresholds (t ∈ {0.95,0.90,0.85}) for selectional association between a verb and an animacy class, expressing an increased gradience within the selectional restriction classes. In the last three rows of table 9.19, we show the results from these three experiments. With a threshold of 0.95 we now have a coverage of 55.9% of the total verb tokens in Talbanken. But even with an increased coverage, we find that the results do not improve significantly compared to the Anim-experiment. 9.2.8.5 Summary The above sections have detailed a strategy for extracting selectional restric- tions for verbs from Talbanken, based around the semantic dimension of ani- macy for which we have annotated data. The ensuing experiments have tested 9.3 Features of the parser 237 the effect of selectional restrictions for verbs during parsing. It has proven difficult to obtain any strong effects from the selectional re- strictions. There are several interdependent factors which contribute to the situ- ation. First of all, it is well known that argument realization is characterized by massive variation. We have looked at variation in semantic argument charac- teristics – verbs select arguments of different classes and do so to differing de- grees. We employed a probabilistic measure of association strength between a verb and a semantic argument class where variation is interpreted as degrees of gradience and experimented with both categorical and gradient selectional re- strictions. One complicating factor is that functional verbs such as auxiliaries, which are highly frequent, do not constrain their arguments and followingly are not assigned a selectional restriction class. It seems fair to assume that we are dealing with a sparse data situation at more than one level. We have a coverage problem in the treebank data due to variation in the realization the nominal semantic classes of verbal arguments. However, even with complete coverage of verbal lemmas in the treebank data, it might still be problematic to make the assumption that the treebank contains sufficient data for assignment of selectional restrictions. This might simply be wrong and it is most likely the case that substantially more data is needed in order to make the right kinds of predictions. We will return to this issue in section 9.4.2 below where we will employ a considerably larger, automatically annotated data set. Moreover, the assumption that selectional restrictions may be reduced to a binary notion of animacy may also be debated. It might be that this is too coarse a distinction to enable interesting generalizations regarding verbal semantics. 9.3 Features of the parser The above experiments have focused on variations in the linguistic input to parsing, whereas properties of the parser have been kept constant. This section will investigate variations over different features of the parser on the input data employed above. In particular, we will examine parser generalizability and feature locality. These will in different ways elucidate further the nature of the effects of our set of linguistic features. 9.3.1 Parser comparison In the experiments in section 9.2, we have exclusively employed the MaltParser system for data-driven dependency parsing. The aim of this section is to com- 238 Parsing with linguistic features pare the effect of the linguistic features investigated above when employing a different parser. In particular, we will investigate the stability of the effects across parsers and to what extent they are dependent on certain properties of the parser. As we mentioned in section 8.1.3, we may distinguish roughly between two current approaches to data-driven dependency parsing: the graph-based approaches and the transition-based approaches (McDonald and Nivre 2007). The main characteristics of the two approaches may be summarized as follows: Graph-based locate the highest-scoring dependency graph given an induced scoring function • global training • exhaustive search/inference Transition-based locate the optimal transition sequence given an induced parse guide • local training • greedy search/inference The graph-based approaches typically employ global training and induce a scoring function for dependency graphs. Parsing is thus construed as search through all possible dependency graphs for a sentence to locate the highest- scoring graph. Transition-based approaches in contrast employ local training in the induction of a parse guide which in combination with a greedy search algorithm optimizes the parse transitions. As a practical consequence of the differences in induction and search, the feature models employed also differ. Graph-based approaches typically employ a rather limited feature model in the representation of dependency graphs, whereas transition-based approaches op- erate with a richer feature history in order to compensate for the decomposition into transitions. These two approaches thus differ with respect to a number of properties, however, achieve comparable overall results in dependency parsing (Buchholz and Marsi 2006). It might therefore be interesting to compare the effect of the linguistic features studied above. 9.3.1.1 MSTParser MSTParser (McDonald, Crammer and Pereira 2005; McDonald et al. 2005) is an instance of a graph-based data-driven dependency parser.186 As we men- tioned above, parsing with MSTParser consists in locating the highest scoring 186MSTparser is freely available from http://mstparser.sourceforge.net 9.3 Features of the parser 239 MaltParser MSTParser Unlabeled Labeled Unlabeled Labeled NoFeats 89.87 84.92 89.67 82.91 Anim 89.93 85.10 89.82 83.04 ADPC 90.17 85.45 90.00 83.22 ADPCV 90.42 85.73 90.21 83.41 Table 9.21: Overall results for experiments comparing MaltParser and MSTParser, expressed as unlabeled and labeled attachment scores. dependency graph according to a scoring function. The scoring function is in- duced through global training with features of the head, dependent, as well as elements occurring in the vicinity of these (before/after/between). Global train- ing is training based on the global dependency graph and features are therefore not limited to previous parse decisions. However, the vast space of possibili- ties, in theory all possible subgraphs, in practice limits the expressiveness of the feature model. The feature model in MSTParser is hard-coded, hence may not be modified as easily. Our additional linguistic features (FEATS) are em- ployed more restrictively in the scoring of edges than the part-of-speech and lexical features and represent only the head, dependent and conjunctions of these. Furthermore, dependency labels are not used as features during parsing. 9.3.1.2 Comparative experiments In a set of experiments we run MSTparser on the NoFeats, Anim, ADPC and ADPCV data sets from Talbanken05 employed in our earlier experiments. Apart from the choice of parser, the experimental setting and evaluation mea- sures are identical to the earlier experiments and MSTparser is run with default settings. Table 9.21 shows the overall results in the MSTParser experiments and con- trasts them with the corresponding MaltParser results. We compare the results for both parsers individually with a baseline employing no additional linguistic features (NoFeats) and experiments testing the addition of information regard- ing animacy (Anim), animacy, definiteness, pronoun type, and case (ADPC), as well as verbal features (ADPCV). These are the equivalents of the experi- ments detailed in sections 9.2.2 and 9.2.7 above. First of all, we may note that the general results, with or without added features, are lower than the corresponding results obtained with MaltParser. The most notable discrepancy is in the labeled results. Table 9.22 shows an 240 Parsing with linguistic features Gold System # SS OO 718 ++ ++ 674 AA OA 557 ET OA 489 AA AA 481 ET ET 469 OA ET 465 OO SS 461 AA RA 451 UK UK 435 Table 9.22: 10 overall most frequent error types for MSTParser on the NoFeats data set, where SS=subject, OO=object, ++=conjunction, AA=other adverbial, OA=object adverbial, ET=nominal post-modifier, RA=spatial adverbial, UK=subjunction. overview of the ten most common labeled error types for the MSTParser base- line. It exhibits some differences from the most common error types for the MaltParser baseline, as presented in table 8.1 in section 8.2.3. In general, we find that attachment errors, such as ++_++, AA_AA, ET_ET, UK_UK, are more common among the MSTParser errors. These are errors where the dependent is labeled correctly, but where the attachment is incorrect. In section 8.2.3 we noted that confusion of subjects and objects, as well as various adverbial re- lations constituted the most common errors in the MaltParser baseline results. We find the same error types in the results for MSTParser as well, and the SS_OO error type is in fact the most common error type made by the baseline parser. The SS_OO and OO_SS errors show very similar properties to the er- rors analyzed in section 8.3.2. There, we found that the distribution of errors differed from the overall distribution of subjects and objects with respect to word order and morphological marking. In other words, the subjects and ob- jects which deviated from the norm were overrepresented among the errors. We find the same, clear pattern in the baseline results for MSTParser. 83.1% of the SS_OO errors are postverbal and 94.3% are realized by a noun or case ambiguous pronoun. For the OO_SS errors, we find that 35.4% are preverbal and 96.7% are nouns or case ambiguous pronouns. As table 9.21 shows, the added features have a positive effect on overall results also when employing MSTParser and we find that all differences are significant compared to the NoFeats baseline. The addition of information on animacy causes a clear improvement in unlabeled results (p<0.0004) and a 9.3 Features of the parser 241 NoFeatsMST AnimMST DT 94.14 94.49 SS 87.32 87.60 ET 71.38 71.81 ++ 90.50 90.75 AN 30.88 32.43 ROOT 94.60 94.72 CC 76.95 77.15 TA 62.39 62.62 PA 93.77 93.82 VG 92.86 93.01 Table 9.23: 10 most improved dependency relations for the MSTParser with added information on animacy, ranked by their weighted difference of balanced F-scores. smaller improvement (p<0.003) in labeled results. This is the converse situ- ation from the results for MaltParser, where the observed improvement was largely in terms of labeled results. If we examine the sets of attachment er- rors, we find the most notable improvement for cases of pronouns, nouns and verbs. These are mainly errors in attachment to verbs (argument attachment), as well as attachment of nominal elements to other nominal elements (phrasal attachment). Table 9.23 shows a ranked list of the dependency relations which show the largest improvement in the AnimMST -experiment. Ranked at the top of the list are the determiner (DT) and subject relations (SS). A closer look at the results shows that the improvement in labeled results is largely due to the improved attachment. As a consequence, we observe a notable reduction in total number of errors for error types involving ambiguities between a phrasal and clausal reading, such as SS_DT, DT_SS. We may note that the performance for the DT relation is in fact identical for MaltParser and MSTParser with a baseline F- score of 94.14 and the effect of the animacy information causes a nearly identi- cal improvement for this relation to 94.48 and 94.49, respectively. A clear dif- ference between the two systems, however, is found in improvement in terms of labeling only. In contrast to the MaltParser results, performance does not improve for the argument relations of OO and SP. The total number of SS_OO and OO_SS errors, representing largely a labeling error, also does not decrease notably.187 187The results show a small improvement of 2.6% for the OO_SS error type, however, these are all due to corrected attachment errors of the SS_DT type in the immediately preceding context, 242 Parsing with linguistic features Freq NoFeatsMST ADPCVMST DT 0.1081 94.14 95.22 SS 0.1105 87.32 88.10 PA 0.1043 93.77 94.21 ET 0.0523 71.38 72.13 OO 0.0632 79.92 80.39 ++ 0.0422 90.50 91.13 ROOT 0.0649 94.60 94.99 CC 0.0343 76.95 77.61 AA 0.0537 63.80 64.18 +F 0.0099 45.38 46.70 Table 9.24: 10 most improved dependency relations for MSTParser with added in- formation on ADPCV, ranked by their weighted difference of balanced F-scores. In the ADPC and ADPCV experiment we find a significant improvement of overall results (p<0.0001) both in terms of labeled, as well as unlabeled re- sults. Table 9.24 shows the performance per dependency relation for the AD- PCV experiment and we find an improved analysis for all argument relations. We may conclude that the added information has a general positive effect also with a parser which is radically different from that of MaltParser. We found that the largest effect is in terms of unlabeled results, hence an increased at- tachment accuracy both for clausal and phrasal constituents. Even so, we gen- erally observe a less notable improvement in terms of labeled results compared to the results in the experiments with MaltParser. McDonald and Nivre (2007) show that MaltParser cross-linguistically has a better performance for core ar- gument relations like subjects and objects than MSTParser and suggest that a possible reason for this is the fact that MSTParser does not condition on the previously assigned dependency relations during parsing. The results from our experiments corroborate this and indicate that the improvement in terms of ar- gument analysis is partially dependent on properties of the preceding analysis during parsing. We have also noted that the additional linguistic features are employed highly locally in MSTParser. In the following section, we will in- vestigate the influence of feature locality in argument disambiguation further. see section 8.3.2 and examples (163)–(164). 9.3 Features of the parser 243 Unlabeled Labeled NoFeats 89.87 84.92 Anim 89.93 85.10 ADPC 90.17 85.45 AnimLocal 89.93 85.08 ADPCLocal 90.16 85.39 Table 9.25: Overall results for experiments with feature locality in MaltParser. 9.3.2 Feature locality In the experiments in section 9.2 above, we have employed the same feature model whilst varying the input of the linguistic features. The experiments with MSTParser discussed above suggested that conditioning on properties of the preceding linguistic context is important in argument disambiguation. Varying the feature model of the parser provides a manner of testing the influence of our features further. The feature model employed by the parse guide in MaltParser provides a rich history for each transition. The feature model in figure 13 shows that the additional linguistic features, represented by the attribute FEATS, are employed for highly local tokens in a candidate head-dependent relation, as well as to- kens which are further removed in the dependency graph, such as siblings and grandparents. We perform a set of experiments where additional linguistic features are limited to the token on top of the stack and the next input token, i.e., top and next. The FEATS-features are thus limited to a highly local context, whereas features for the remaining attributes, FORM, POS, DEP are kept constant. Table 9.25 shows the overall results for two experiments employing this local feature model with various argument features: AnimLocal , where animacy information is included, and ADPCLocal , which employs information on animacy, definite- ness, pronominal type and case. We chose to focus on argument features, and not include verbal features, in order to enable isolation of the effects on argu- ments. The results indicate that the observed effect of the argument features is largely local. Both experiments (AnimLocal , ADPCLocal) show slightly, but not significantly, lower overall results compared to the counterparts employing a full feature model (Anim, ADPC). This means that the added information re- garding candidate head and dependent is responsible for a majority of the im- provement observed with the added features. 244 Parsing with linguistic features Freq NoFeats AnimLocal DT 0.1081 94.14 94.55 SS 0.1105 90.25 90.65 OO 0.0632 84.53 84.83 AA 0.0537 68.70 69.05 PA 0.1043 94.69 94.78 TA 0.0249 70.29 70.64 OA 0.0305 70.63 70.90 AT 0.0441 95.76 95.92 UK 0.0305 93.17 93.39 SP 0.0297 84.82 85.02 Freq NoFeats ADPCLocal DT 0.1081 94.14 95.14 SS 0.1105 90.25 91.12 PA 0.1043 94.69 95.31 OO 0.0632 84.53 85.47 ET 0.0523 76.46 77.09 AA 0.0537 68.70 69.21 SP 0.0297 84.82 85.54 FS 0.0050 71.31 74.08 OA 0.0305 70.63 71.00 IO 0.0024 76.14 79.68 Table 9.26: 10 most improved dependency relations with the local feature model and animacy features (left) and animacy, definiteness, pronoun type and case features (right), ranked by their weighted difference of balanced F-scores. In table 9.26 we see a ranked list of the most improved dependency rela- tions, compared to the NoFeats baseline, in the experiments with a local feature model. Compared to the full feature model counterparts we observe somewhat lower results for the argument relations. In parallel with the observations made in the MSTParser experiments, we find that the DT relation is the relation for which we find the largest improvement, clearly indicating a local effect of the features. The determiner-head relation constitutes a phrasal context where two nominal elements must be disambiguated. On several occasions, we have noted the animacy effect in genitive constructions. Determiners are typically not an- imate, unless marked with genitive case, and these are inherent properties of the nominal in question. Differentiating features of the two nominals clearly benefit disambiguation, and we observe improved attachment for determiners, as well as labeled improvement for the DT_SS and SS_DT error types. Unlike the results obtained with MSTParser, however, we also observe improved per- formance for a range of argument relations in the AnimLocal experiment, such as objects (OO) and subject predicatives (SP). The fact that MaltParser con- ditions on preceding dependency relations is a factor which still distinguishes the two parsers. For the OO and SP relations, which are predominantly found in postverbal position, knowledge regarding assignment of preverbal dependents is clearly important for correct analysis. The overall results presented above cover some interesting differences be- tween the local and full feature model parsers which a further error analysis reveals. Table 9.27 shows relevant excerpts from the confusion matrices for the SS, OO, SP, FS and IO argument relations. It provides the total number of 9.3 Features of the parser 245 NoFeats AnimLocal ADPCLocal Anim ADPC Gold System # % tot. # % # % # % # % SS OO 446 25.9 419 6.1 393 11.9 388 13.0 361 19.1 OO SS 309 23.8 299 3.2 274 11.3 263 14.9 251 18.8 SS SP 206 12.0 202 1.9 202 1.9 203 1.5 188 8.7 SP SS 240 31.3 231 3.8 225 6.2 231 3.8 213 11.2 FS SS 281 93.0 279 0.7 251 10.7 272 3.2 241 14.2 IO OO 97 67.4 95 2.1 91 6.2 95 2.1 89 8.2 Table 9.27: Total numbers of errors for error types in experiments with the local fea- ture model (AnimLocal , ADPCLocal) compared to the full feature model baseline (NoFeats) and respective counterparts (Anim, ADPC). errors for error types expressing confusion of argument relations in the local experiments, compared with the full feature model baseline and the respective counterparts. The first two rows show errors of the types SS_OO and OO_SS and we find an improvement with local features for both error types in both experiments. A comparison with the results using a full feature model further- more shows that errors of these types further improve with the use of less local features. We observe the same pattern for other error types involving the confu- sion of argument relations. For instance, we find a reduction in total numbers of errors for the SS_SP and SP_SS error types in the ADPCLocal-experiment, stemming from the addition of the definiteness feature. We also observe a fur- ther improvement for both of these with the full feature model. Error types expressing confusion of other argument relations, such as FS_SS and IO_OO, presented in the last rows of table 9.27, further corroborate the general effect. 9.3.3 Features of argument differentiation It is clear that the addition of argument features have an effect on the analysis of arguments, both in a highly local setting with no structural context, as in the MSTParser experiments, in a local setting with a structural context, as in the local MaltParser experiments, and clearly also in the experiments with a full feature model in combination a structural context. The results of the experiments performed in the current section highlight the relative aspect of argument disambiguation and argument differentiation in general. Arguments are differentiated not only by inherent properties, but also by their properties relative to other arguments. As we discussed in chap- ter 3, the linguistic dimensions by which arguments tend to differ incur soft, rather than hard effects. As we know, knowledge that an element is inanimate 246 Parsing with linguistic features Feature Application Definiteness POS-tagger Case POS-tagger Animacy - NN Animacy classifier Animacy - PN Named Entity Tagger Animacy - PO Majority class Tense, voice POS-tagger Table 9.28: Overview of applications employed for automatic feature acquisition. or indefinite, is often, in itself, not sufficient evidence for interpretation as, say, object. Additional knowledge with respect to the linguistic context of the el- ement provides further knowledge. For instance, knowledge that an animate, definite nominal has been assigned subject status in preverbal position clearly makes an object relation for the inanimate element more likely. 9.4 Automatically acquired features A possible objection to the general applicability of the results presented above is that the added information consists of gold standard annotation from a tree- bank. However, the morphosyntactic features examined here are for the most part straightforwardly derived (definiteness, case, tense, voice) and represent standard output from most part-of-speech taggers. In chapters 6 and 7, we showed that the property of animacy could be fairly robustly acquired for com- mon nouns by means of distributional features from an automatically parsed corpus. In this section we investigate parsing with automatically acquired lin- guistic features. 9.4.1 Acquiring the features The linguistic features may be acquired through the use of different NLP- applications and table 9.28 shows an overview of the applications employed for the automatic acquisition of our linguistic features. For part-of-speech tag- ging, we chose to employ MaltTagger – a HMM part-of-speech tagger for Swedish (Hall 2003). The pretrained model for Swedish employs the SUC tagset (Gustafson-Capková and Hartmann 2006), exemplified by the part-of- speech tagged version of (182) in (183) below. 9.4 Automatically acquired features 247 (182) Några some har have valts choose-PASS ut out och and med with dem them skall shall man one nu now börja start slutförhandlingen negotiation-DEF ‘Some have been chosen and we will now commence negotiations with them’ (183) Example (182); part-of-speech tagged with SUC tagset: Några dt.utr/neu.plu.ind har vb.prs.akt.aux ⇒ tense, voice valts vb.sup.sfo ⇒ tense, voice ut pl och kn med pp dem pn.utr/neu.plu.def.obj ⇒ case skall vb.prs.akt.mod ⇒ tense, voice man pn.utr.sin.ind.sub ⇒ case nu ab börja vb.inf.akt ⇒ tense, voice slutförhandlingen nn.utr.sin.def.nom ⇒ definiteness, case The SUC part-of-speech tag set distinguishes tense and voice for verbs, nom- inative and accusative case for pronouns, as well as definiteness and genitive case for nouns. The experiments with the individual verbal features described in section 9.2.6 clearly showed the benefit of mapping the tense values to a binary set of finiteness features and this mapping was performed directly for the acquired features.188 The experiments with features expressing pronoun type, described in sec- tion 9.2.4 above, showed that the effect of this feature was largely due to the treatment of non-referential pronouns. Acquisition of non-referentiality is not a trivial task, although it has recently been approached with machine-learning (Boyd, Gegg-Harrison and Byron 2005). Given the fairly modest impact of this feature, however, acquisition of non-referentiality is not pursued further in the present context. 9.4.1.1 Animacy The feature of animacy is clearly the most challenging feature to acquire auto- matically. Recall that Talbanken05 distinguishes person reference for all nom- inal constituents, and as shown in section 7.1.2, 97.8% of the nominal treebank 188Present, past, imperative and subjunctive forms are mapped to the finite feature (FV), all other forms are mapped to the non-finite feature (Ø). 248 Parsing with linguistic features instances annotated as animate are nouns and pronouns. Hence we will in the following focus on automatic animacy annotation for nouns and pronouns. Common nouns The animacy classifier developed in chapter 7 classifies common nouns based on their syntactic distribution in the Swedish Parole corpus. Whereas the gold standard classes are employed for training of the classifier, the distributional data is taken from the considerably larger, automatically parsed Parole cor- pus. The common nouns in Talbanken05 are classified for animacy following a leave-one-out training and testing scheme where each of the n nouns in Tal- banken05 are classified with a classifier trained on n−1 instances.189 This en- sures that the training and test instances are disjoint at all times. Moreover, the fact that the distributional data is taken from a separate data set ensures non- circularity since we are not basing the classification on gold standard parses. Proper nouns In the task of named entity recognition (NER) (Tjong Kim Sang 2002b), proper nouns are classified according to a set of semantic categories (see, e.g., Chin- chor et al. 1999). For the annotation of proper nouns, we make use of a named entity tagger for Swedish (Kokkinakis 2004), which is a rule-based tagger based on finite-state rules, supplied with name lists, so-called “gazetteers”. The tagger distinguishes the category ‘Person’ for human referring proper nouns and we extract information on this category. Pronouns In section 6.2.3 we extracted information on pronominal reference to nouns based on simple heuristics with respect to a set of pronouns and syntactic po- sition (the ANAAN/ANAIN features). Recall that a subset of the personal pro- nouns in Scandinavian, as in English, clearly distinguish their referent with re- gard to animacy, e.g. han, det ‘he, it’. There is, however, a quite large group of third person plural pronouns which are ambiguous with regards to the animacy of their referent. The ambiguous pronouns include the personal pronouns, e.g., de, dem, deras ‘they, them, theirs’ , demonstrative pronouns, e.g, dessa ‘these’, as well as quantifying pronouns like bägge, alla, många ‘both, all, many’. 189We employ the MBLopt classifier described in section 7.3.3. 9.4 Automatically acquired features 249 Dimension Features Instances Correct Accuracy Definiteness DD, Ø 40832 40010 98.0 Case GG, AA, Ø 68313 67289 98.5 AnimacyNNPNPO HH, Ø 68313 61295 89.7 AnimacyNN HH, Ø 40832 37952 92.9 AnimacyPN HH, Ø 2078 1902 91.5 AnimacyPO HH, Ø 25403 21441 84.4 Finiteness FV, Ø 30767 30035 97.6 Voice PA, Ø 30767 29805 96.9 Table 9.29: Accuracy for automatically acquired linguistic features. The pronominal part-of-speech tags from the part-of-speech tagger distinguish number and gender and in the animacy classification of the personal pronouns we classify based on these tags only. We employ a simple heuristic where the pronominal tags which had more than 85% human instances in the gold stan- dard are annotated as human.190 This gives us the personal non-neuter pro- nouns, like vi, oss, han, du, man ‘we, us, he, you-SG, one’, as well as the set of genitive pronouns, like din, min, sina ‘your, mine, theirs’, as animate (HH).191 The pronouns which are ambiguous with respect to animacy are not annotated as animate (Ø). In table 9.29 we see an overview of the accuracy of the acquired features, i.e., the percentage of correct instances out of all instances. Note that we ad- here to the general annotation strategy in Talbanken05, where each dimension (definiteness, case etc.) contains a null category Ø, which expresses the lack of a certain property. Many of the dimensions exhibit quite skewed distributions, hence in table 9.30, we present the class-based measures of precision and recall for each of the non-null features. Acquisition of morphological definiteness for common nouns is clearly reliable, with an overall accuracy of 98.0, despite a skewed distribution of classes. Precision and recall for the definite feature (DD) is 97.7 and 96.0, respectively. With respect to case, a property of nouns and pronouns, we find an overall accuracy of 98.5, as table 9.29 shows. However, the genitive and accusative case features are seriously outnumbered by the set of null, or 190A manual classification of the individual pronoun lemmas was also considered. However, the treebank has a total of 324 different pronoun forms, hence we opted for a heuristic classifi- cation of the part-of-speech tags instead. 191We manually excluded the third person non-neuter pronoun den ‘it’ from this group of human-referring pronouns. 250 Parsing with linguistic features Feature Gold Automatic Correct Precision Recall DD 14094 13924 13598 97.7 96.5 GG 3756 3414 3321 97.3 88.4 AA 1745 2180 1707 78.3 97.8 HH 16875 10777 10317 95.7 61.1 HHNN 6010 3538 3334 94.2 55.5 HHPN 1056 920 900 97.8 85.2 HHPO 9809 6319 6083 96.3 62.0 FV 20818 20560 20371 99.1 97.9 PA 2413 3067 2259 74.0 93.6 Table 9.30: Class precision and recall for automatically acquired linguistic features compared to gold standard. nominative, instances. As table 9.30 shows, acquisition of genitive case shows a somewhat lower recall of 88.4. For accusative case we observe the opposite situation where, the part-of-speech tagger is overgenerating compared to the gold standard. It is not surprising that we observe the largest discrepancies from the gold standard annotation in the automatic animacy annotation. In general, the an- notation of animate nominals exhibits a decent precision (95.7) and a lower recall (61.3). The automatic classification of human common nouns also has a quite high precision (94.2) in combination with a lower recall (55.5). As we noted in chapter 7, this is an advantage provided the skewed distribution of the classes in the corpus, since it indicates that the classifier is conservative in terms of class assignment to the minority class. The named-entity recognizer shows more balanced results with a precision of 97.8 and a recall of 85.2 and the heuristic classification of the pronominal part-of-speech tags gives us high precision (96.3) combined with lower recall (62.0) for the animate class. Just as for the other morphological features, the acquisition of the verbal features of finiteness and voice from the part-of-speech tagger is very reliable, with accuracies of 97.6 and 96.9, respectively. The passive feature is infrequent and shows a quite low precision (74.0) due to syncretism in the s-suffix which is employed for both passives and deponent verbs. 9.4.2 Experiments The experiments assess the extent to which we may obtain the same effect from the linguistic information with automatically acquired features. This is an important part of assessing the scalability of the results discussed above. 9.4 Automatically acquired features 251 Gold standard Automatic Unlabeled Labeled Unlabeled Labeled NoFeats 89.87 84.92 89.87 84.92 Def 89.87 85.02 89.88 85.03 Case 89.99 85.13 89.95 85.11 Finite 90.24 85.33 90.15 85.23 Voice 89.81 84.97 89.83 85.00 Anim 89.93 85.10 89.86 85.01 AnimNN 89.81 84.94 89.86 84.99 AnimNNPN 89.85 84.98 89.85 84.97 ADC 90.13 85.35 90.01 85.21 ADCV 90.40 85.68 90.27 85.54 Table 9.31: Overall results in experiments with automatic features compared to gold standard features, expressed as unlabeled and labeled attachment scores. 9.4.2.1 Experimental methodology The experimental methodology is identical to the one described in 9.2.1 above, the only difference being that the linguistic features are acquired automatically, rather than being gold standard. As before, all experiments are performed using 10-fold cross-validation on the written part of Talbanken05 and the feature model is the extended feature model in figure 13. In order to enable a direct comparison with the results from the earlier experiments, we employ the gold standard part-of-speech tags, as before. This means that the set for which the various linguistic features are defined is identical, whereas the feature values may differ. 9.4.2.2 Results Table 9.31 presents the overall results with automatic features, compared to the gold standard results.192 As expected, we find that the effect of the automatic features is generally less pronounced compared to the gold standard counter- parts. However, all automatic features improve significantly on the NoFeats baseline. In the error analysis we find the same tendencies in terms of im- provement for specific dependency relations and error types. 192The results for the gold standard combined experiments ADC and ADCV in table 9.31, are somewhat lower than the combined results presented in section 9.2.7, since the former experi- ments do not include the pronoun type feature. 252 Parsing with linguistic features Morphological features The morphological argument features from the POS-tagger are reliable, as we saw above, and we observe almost identical results to the gold standard results. The addition of information on definiteness causes a significant improvement (p<.01), and so does the addition of information on case (p<.0001). The im- provement in terms of performance for specific dependency relations is also almost identical, with only small, non-significant variations. As before, the addition of information on definiteness causes the largest effect in terms of performance for the SP relation, as well as improvement for the SS and OO relations. Case information benefits the analysis for determiners and preposi- tional complements, but also argument relations such as SS, OO, SP and IO. In parallel with the gold standard results, we find that the single feature which has the most notable effect on performance is the feature of finiteness (p<.0001). It influences the analysis of the argument relations, as well as the verbal relations. Animacy The addition of the automatically acquired information on animacy shows some interesting results. First of all, the addition of all acquired animacy in- formation for nouns and pronouns (Anim) causes a significant improvement (p<.03), even though it is smaller than in the gold standard experiment. We find that the OO and SS relations are the dependency relations which exhibit the largest improvement. A clear difference from the gold standard experiment, however, resides in the performance for the DT-relation, where performance actually deteriorates slightly.193 This is largely due to the set of plural pronouns mentioned above, which are ambiguous with respect to the animacy of their referent. In the gold standard, however, their animacy in the specific context has been manually determined. These pronouns may furthermore function as determiners, in which case they are never annotated as animate. Consequently, animacy serves as an indicator of clausal as opposed to phrasal argument status which is not provided with the automatic annotation. We may examine the effect of the different sources of animacy informa- tion, i.e. the animacy information supplied for common nouns, proper nouns and pronouns, by examining their effect on parse results. As the results in ta- ble 9.31 indicate, it is the information supplied by the animacy classifier for common nouns which largely accounts for the improvement observed with 193F-scores for the DT relation go from 94.14 in the baseline to 94.09 in the Anim experiment with automatic features. 9.4 Automatically acquired features 253 the addition of this feature. This is surprising since the recall for this feature is quite low. The addition of information only for common nouns in the AnimNN- experiment causes a significant improvement in overall results (p<.04). In the corresponding gold standard experiment, the results are not significantly better than the baseline and the main, overall, improvement clearly stems from the animacy annotation of pronouns. This indicates that the animacy information for common nouns, which has been automatically acquired from a consider- ably larger corpus, captures distributional distinctions which are important for the general effect of animacy and furthermore that the differences from the gold standard annotation prove beneficial for the results. An error analysis shows that the performance of the two parsers with re- spect to argument relations is very similar and we observe an improved analy- sis for the SS, OO, SP, IO with only minor variations.194 This in itself is remark- able, since the covered set of animate instances is notably smaller in the au- tomatically annotated data set, as shown by table 9.30 above. We furthermore find that the main difference between the gold standard and automatic AnimNN experiments does not reside in the analysis of arguments, but rather of non- arguments. One relation for which performance deteriorated with the added information in the gold AnimNN experiment is the nominal postmodifier rela- tion (ET) which is employed for relative clauses and nominal PP-attachment. With the automatically assigned feature, in contrast, we observe an improve- ment in the performance for the ET relation, compared to the gold standard experiment, from a F-score in the latter of 76.14 to 76.40 in the former. Since this is a quite common relation, with a frequency of 5% in the treebank as a whole, the improvement has a clear effect on the results. The analysis of postnominal modification is influenced by the differences in the added animacy annotation for the nominal head, as well as the internal dependent. If we examine the corrected errors in the automatic experiment, compared to the gold standard experiment, we find elements with differing annotation. In general, the relation of postnominal modification disprefers at- tachment to animate nominals. Consider (184)–(185) below which illustrate corrected errors of the types ET_OA and ET_AA, respectively. The nominal heads in these constructions, vän ‘friend’ and kandidater ‘candidates’, are in- stances which are annotated as animate in the gold standard, but inanimate in the automatically classified data set. The automatic annotation as inanimate results in a correct attachment and labeling of the modifiers, a relative clause in (184) and the head preposition till ‘to’ in (185). 194The gold standard AnimNN results exhibit slightly better performance for the SS and SP relations, whereas the automatic AnimNN results show slightly better performance for the OO and IO relations. 254 Parsing with linguistic features (184) De the flesta most vill want trots after allt all ha have en a riktig real vän friend att to hålla hold ihop together med with ‘Most people, after all, want a real friend to be together with’ (185) För for kandidater candidates till to landsting municipals och and kommunfullmäktige boards gäller holds fortfarande still bostadsbandet residence-restriction-DEF ‘With respect to candidates for municipal boards, the restriction on residence still holds’ We also observe an effect of differing annotation for the nominal dependent in prepositional ET constructions. Preferences with respect to animacy of preposi- tional complements vary, as we noted in section 7.1.2 and illustrated with table 7.5 on page 137. In (186), the automatic annotation of the noun djur ‘animal’ as animate results in correct assignment of the ET relation to the preposition hos ‘among’, as well as correct nominal, as opposed to verbal, attachment. This preposition, as we recall, is one of the few with a preference for animate com- plements. In contrast, the example in (187) illustrates a ET_OA error, where the automatic classification of barn ‘children’ as inanimate causes a correct analysis of the head preposition om ‘about’. (186) . . . mer . . . more permanenta permanent samhällsbildningar societies hos at olika different djur animals ‘. . . more permanent social organizations among different animals’ (187) Föräldrar parents har have vårdnaden custody-DEF om of sina their barn children ‘Parents have the custody of their children’ A more thorough analysis of the different factors involved in PP-attachment is a complex task which is clearly beyond the scope of the present study. We may note, however, that the distinctions induced by the animacy classifier based purely on linguistic evidence proves useful for the analysis of both arguments and non-arguments. Selectional restrictions revisited In section 9.2.8, we investigated enrichment of the treebank annotation by ex- tension to verbal classes of selectional restrictions centered around the notion 9.4 Automatically acquired features 255 of animacy. We concluded that the added information caused more damage than good, in spite of modest improvements for argument relations. In particu- lar, we found deterioration in attachment caused by the addition of verbal class information. We noted earlier that a larger data set might provide more reliable general- izations regarding verbal semantics. We therefore extracted selectional restric- tions from the automatically tagged and parsed version of the Parole corpus, where animacy was assigned automatically, as detailed in section 9.4.1 above. The selectional restrictions were otherwise extracted and selectional associa- tion was calculated in a manner identical to the one described in section 9.2.8. The only restriction placed on the extraction was a frequency threshold of 10 overall instances in the Parole corpus. Clearly the data employed is consider- ably more noisy, relying on fully automatic annotation. An inspection of the resulting classification shows that the noise in the data influences the selec- tional associations. We examined the selectional association scores acquired for the verbs skriva ‘write’ and visa ‘show’ which we discussed in section 9.2.8. These indicate that associations with the animate class in general are considerably lower than under treebank acquisition. This is not surprising since we rely on automatic animacy classification with a quite low recall. However, we also know that the classifier has fairly good precision, so we may assume that the quality of these restrictions is reasonable despite the variation. For the subject argument of the verb skriva ‘write’, we find that the association with the animate class is 0.76, compared to 0.95 in the gold standard experiment. For the parse experiments we set a selectional association threshold of 0.75 in order to take into account the noise in the data. This gives us a very high cov- erage of the treebank verbs, unlike the previous experiments. With a threshold of 0.75, as many as 91.7% of the verb tokens receive a class. A closer look at the classes shows that several reasonable distinctions are captured, exemplified by (188)–(191) which show Talbanken05 verbs from the different classes with a threshold of 0.75: (188) AnimSS: lära ‘learn’, berätta ‘tell’, hitta ‘find’, märka ‘notice’, jobba ‘work’ (189) InanSS: gälla ‘concern’, händer ‘happen’, kosta ‘cost’, betyder ‘mean’, minska ‘lessen’ (190) AnimOO: bry ‘bother’, gifta ‘marry’, förlåta ‘forgive’, älska ‘love’, umgås ‘socialize’195 195The class of AnimOO include a group of so-called deponent verbs, characterized by a pas- sive s-suffix, but which have an agentive semantics. Examples include hoppas ‘hope’, trivas ‘en- joy’. These have been part-of-speech tagged as passives, hence their subject has been recorded as an object in terms of selectional restrictions. 256 Parsing with linguistic features Unlabeled Labeled NoFeats 89.87 84.92 Anim 89.86 85.01 SR0.75 89.89 84.91 SR0.75&Anim 89.92 84.95 Table 9.32: Overall results for experiments with selectional restrictions acquired from the Parole corpus with automatically acquired animacy informa- tion. (191) InanOO: utföra ‘execute’, göra ‘do’, söka ‘seek’, ge ‘give’, veta ‘know’ The InanOO class is in clear majority and is assigned to 84.8% of the tokens. Inaccuracy in the annotation for the inanimate class along with overgeneration of passives for verbs, as discussed above, is the cause of this overgeneration. We perform two parse experiments with the acquired selectional restric- tions, one with only verbal classes (SR0.75) and one with the acquired animacy information as well (SR0.75&Anim). Table 9.32 shows the results which do not differ significantly from the baseline. We may note, however, that unlike the gold standard experiments, we observe an improved, rather than deteriorated, attachment accuracy, given by the unlabeled attachment score. This is most likely a result of the increased coverage of classification. Feature combinations In parallel with the results achieved with gold standard features, we observe an improvement of overall results compared to the baseline (p<.0001) and each of the individual features when we combine the features of the arguments (ADC; p<.01) and the argument and verbal features (ADCV; p<.0001). Table 9.33 shows the dependency relations which improve the most in the ADCV-experiment and table 9.33 shows the ranked list of argument relations only. We may compare the results here with the corresponding information for the gold standard experiment ADPCV presented in table 9.12 above. We find that the ranked lists are nearly identical, but with overall somewhat lower results in the experiment with automatic features. We thus observe the same tendencies with the automatically acquired features. With respect to argument relations, we find improvement for all relations except the FS relation. This difference is clearly due to the fact that our set of automatic features does not include information on referentiality for pronouns. 9.5 Summary of main results 257 Freq NoFeats ADCV SS 0.1105 90.25 91.32 OO 0.0632 84.53 86.10 DT 0.1081 94.14 94.67 VG 0.0302 94.65 96.44 PA 0.1043 94.69 95.06 ROOT 0.0649 86.71 87.26 +F 0.0099 52.07 55.27 SP 0.0297 84.82 85.80 MS 0.0096 63.35 66.06 AA 0.0537 68.70 69.04 Freq NoFeats ADCV SS 0.1105 90.25 91.32 OO 0.0632 84.53 86.10 SP 0.0297 84.82 85.80 AG 0.0019 73.56 81.02 FO 0.0009 56.68 65.38 VO 0.0007 72.10 83.12 VS 0.0006 58.75 68.75 ES 0.0050 71.82 72.60 IO 0.0024 76.14 76.29 OP 0.0011 27.91 30.77 Table 9.33: 10 most improved dependency relations with automatic ADCV features (left) and improved argument relations with automatic ADCV features (right), ranked by their weighted difference of balanced F-scores. 9.5 Summary of main results The error analysis presented in chapter 8 revealed consistent errors in syntac- tic analysis, namely the confusion of argument functions, resulting from word order ambiguity and lack of case marking. In the current chapter, a set of exper- iments have been reported which examine the effect of various linguistically motivated grammatical features hypothesized to target these errors. A set of linguistic features were formulated which capture different aspects of argument relations. The features provided approximations of linguistic di- mensions shown to be involved in argument differentiation in a range of lan- guages, as well as more language-specific properties of Scandinavian argument realization. An extended feature model enabled us to experiment with the addi- tion of lexical information for arguments through features expressing animacy, definiteness, pronoun type and case. The experiments showed that each feature individually causes a significant improvement in terms of overall labeled accu- racy, performance for argument relations, and error reduction for the specific types of errors performed by the baseline parser. Error analysis comparing the baseline parser with new parsers trained with individual features reveal the in- fluence of these features on argument disambiguation. We find that animacy influences the disambiguation of subjects from objects, objects from indirect objects as well as the general distinction of arguments from non-arguments. Definiteness has a notable effect on the disambiguation of subjects and sub- ject predicatives, and pronoun type distinguishes between referential and non- referential subjects. Information on morphological case shows a clear effect 258 Parsing with linguistic features in distinguishing between arguments and non-arguments, and in particular, in distinguishing nominal modifiers with genitive case. Experiments with fea- tures of the verb included information on tense and voice, and furthermore established the importance of the property of finiteness in parsing of Scandi- navian. The final experiments combining all features, exhibited a cumulative effect of the linguistic features and also served to validate the choice of these features as important factors in argument disambiguation. The ADPCV exper- iment which combined information on animacy, definiteness, pronoun type, case and verbal features showed results which differed significantly from the baseline, as well as each of the individual experiments (p<.0001). We found clear improvements for the analysis of all argument relations and clear error reduction in terms of argument disambiguation. For the error types confusing subjects and objects (SS_OO, OO_SS), for instance, we observe a 44.6% and 46.0% error reduction compared to the baseline. In section 9.2.8, we furthermore enriched the treebank annotation with selectional restrictions, a relational category expressed as a lexical semantic property of the verb which determines the animacy of its arguments. The study discussed in section 9.2.8, showed the importance of dealing with variation in restrictions and a probabilistic measure of selectional association allowed us to experiment with various levels of gradience for the selectional restric- tion classes. Experiments testing the effect of selectional restrictions acquired from the treebank, as well as restrictions acquired from an automatically anno- tated and considerably larger corpus, gave inconclusive results and we found no significant improvements compared to the simple addition of information on animacy. The experiments indicate that information on argument animacy can, and should, be utilized independently of selectional restrictions from the verb. In section 9.3 we examined the effect of variations over properties of the parser on argument disambiguation. The application of a graph-based, data- driven dependency parser to the same data sets as earlier enabled a contrastive study of argument disambiguation. We observed significant improvements with the added information, however, the error analysis for argument relations high- lighted the importance of conditioning on a rich linguistic context. Experi- ments with a local feature model for MaltParser further elucidated the relative influence of our features in argument disambiguation. The scalability of the results was addressed in section 9.4. In contrast to the earlier experiments, the linguistic features employed during parsing were ac- quired automatically. We found that the results may largely be replicated with automatic features and a generic part-of-speech tagger. All added features gave significant improvements over the baseline and the tendencies in terms of error reduction for specific dependency relations were highly similar. We further- 9.5 Summary of main results 259 more employed annotation from the animacy classifier developed in chapter 7 during parsing, and in this way externally evaluated the lexical information acquired there. The application of animacy information based purely on lin- guistic, distributional data proved to capture important distinctions which gave a performance which was as good as, and even slightly better than, the gold standard counterpart experiment. 10 CONCLUDINGREMARKS In the introduction to this thesis we set out to study linguistic factors involved in argument differentiation and made the initial assumption that these may be studied using data-driven methods which generalize over language data. A unifying theme in the work presented here has been the induction of linguistic generalizations from differential properties of syntactic arguments in language data. Consistent correlations between the morphosyntactic and semantic real- ization of arguments have been exploited in the lexical acquisition of animacy, which was the topic of chapters 6–7 and in argument disambiguation in syn- tactic parsing, which was the focus in chapters 8–9. In this final chapter, we conclude the thesis by outlining its main contribu- tions and directions for future work. We will in particular describe the main findings which unite the thesis, as well as more specific contributions internal to its two main parts. 10.1 Main contributions The underlying methodological conviction expressed throughout this thesis has been an empiricist one, focusing on the essential role of language data in linguistic investigations. We have shown how data-driven, computational models of language can be employed for linguistic investigations and in turn how linguistic generalizations can improve on computational models. The main contributions of the thesis are found in its attempt to unify in- sights from different subfields of linguistics, in particular theoretical and com- putational approaches. We have seen how the study of soft constraints and gradience in language can be carried out using data-driven models and have argued that these provide a controlled setting where different factors may be evaluated and their influence quantified. By focusing on empirical evaluation, we have come to a better understanding of the results and implications of data- driven models and we have shown how linguistic motivation in turn can lead to improved computational models. Data-driven models clearly benefit from linguistically informed feature selection and error analysis. 262 Concluding remarks 10.1.1 Lexical acquisition The initial assumption made at the beginning of Part II is that there is a close relation between the syntactic distribution of nouns and their semantic prop- erties; so close, in fact, that we may approach the latter by generalizing over the former. A corpus study of the distribution of human and inanimate nom- inal elements, detailed in section 7.1.2, confirms the central role of animacy in argument differentiation and shows significant distributional differences be- tween the distinctions in argumenthood established in chapter 3.1: subject and object, core and non-core arguments, as well as argument and non-argument. We approach the task of animacy classification for nouns through data-driven, lexical acquisition based on morphosyntactic distributional data which capture exactly the tendencies in argument differentiation discussed above. The task of animacy classification is not a widely studied one in computa- tional linguistics, allthough it resembles other semantic classification tasks like named-entity recognition or verb classification. A main contribution of Part II is thus found in the definition of the classification task and the identification of several factors central to its performance. We have shown that classification performance is influenced by several factors, such as data representation and sparsity, the size of the data sets and their class distribution. Obtaining animacy data is another topic which has been dealt with extensively. We have reviewed and evaluated annotation schemes for animacy and addressed the inventory of classes through empirical investigations into the dimension of animacy and its gradience. In chapters 6–7 we identified several factors which individually and in com- bination influence the classification results. We varied the feature representa- tion of the nouns, from a small set of theoretically motivated features to a more general feature space. An accuracy of 95% was obtained on a small set of high frequency nouns with only seven morphosyntactic features, and we as- certained that backing off to a smaller set of the three most frequent features allowed us to maintain similar performance for nouns with considerably lower frequencies (∼50). The scaling of the classification task to a larger data set ex- tracted from an annotated treebank, highlighted the skewed distribution of the classes of animate and inanimate, showing an approximate 10-90 split in the data. An important part of dealing with the skewed class distribution was found in extending the feature space to include more information on each individual noun. We also saw how the size of the data set influenced the performance of the classifier, however, notwithstanding the influence of data sparsity. We ob- tain results for animacy classification, ranging from 97.3% accuracy to 94.0% depending on the sparsity of the data. With an absolute frequency threshold of 10, we obtain an accuracy of 95.4%, which constitutes a 50% reduction of 10.1 Main contributions 263 error rate. The classifier is conservative with respect to the minority class of animate instances and, with a frequency threshold of 1, it exhibits a precision of 79.1 and a recall of 40.5 for the animate class. The corresponding results for the majority class of inanimate elements is 94.6 and 99.0. We initially defined the classification task as a binary one with the classes of ‘animate’ and ‘inanimate’. Gradience in the animacy dimension was estab- lished through experiments with varying class granularity as well as a com- parison of human and automatic annotation for the dimension of animacy. In section 6.6, we included a set of collective nouns denoting organizations in our data set. Results from a three-way classification experiment showed that these constitute a distinct group based on linguistic distribution due to their potential for both highly agentive, as well as mass-like readings. Clustering experiments with the same data set clearly supports a main distinction between animate and inanimate entities, with a gradience of the animate category which extends to the aforementioned group of organization nouns. A comparison of nouns which show gradient properties in the human annotation for animacy and in the automatic classification underline the fact that animacy classifica- tion strictly deals with animacy as a linguistic category. We find that 53.1% of inanimate entities which are misclassified as animate by the automatic classi- fier are elements of gradient animacy, such as animals, collective nouns and abstract or vague nouns. The treatment of animacy as a denotational property based only on linguistic evidence has lead to a consistent annotation which captures relevant information in the task of argument disambiguation. Distributional features and a denotational treatment of animacy were argued to constitute prerequisites for acquisition of lexical preferences based on soft, probabilistic constraints on arguments. We have seen that proposed distinctions relevant to the animacy dimension may be explored employing machine learn- ing. The extensive feature analysis performed throughout Part II has clearly shown the acquisition of these functional, distributional preferences. We con- clude that statistical tendencies in argument differentiation with respect to the dimension of animacy supports automatic classification of unseen nouns and has been shown to be robust, generalizable across machine learning algorithms – both supervised and unsupervised – as well as scalable to larger data sets. 10.1.2 Parsing Part III of this thesis was devoted to the study of argument disambiguation in data-driven syntactic parsing. The main goal of this part of the thesis was to in- vestigate the influence of various linguistic features on argument disambigua- tion. We motivated the choice of a data-driven parser by the direct relationship 264 Concluding remarks to frequency in language use which alleviates the need for explicit formulation of constraints on arguments. Moreover, a dependency representation enables acquisition of generalizations at the level of grammatical functions, abstract- ing away from specific, structural realizations, whilst limiting the structural assumptions to the minimal syntactic relation between a head and its depen- dent. For our experiments we employed MaltParser, a language-independent system for data-driven dependency parsing. In order to enable a detailed study of the influence of different features, we developed an explicit methodology for error analysis of the results, whereby we manipulate sets of errors and in this way quantify improvement and deteri- oration of results. The results from a baseline parser were analyzed in chapter 8, where we employed only a limited set of features to represent tokens in the parse configuration: part-of-speech (POS), lexical form (FORM) and previously assigned dependency relations (DEPREL). We noted the acquisition of a range of generalizations regarding syntactic arguments based only on the distribution of these features in the data, such as the canonical ordering of arguments, cat- egorial and lexical preferences with respect to argument realization and a rea- sonably good distinction of arguments from non-arguments. The error analysis also revealed consistent errors in argument assignment, and we determined that properties common to Scandinavian type languages, namely word order varia- tion combined with little morphological marking, were largely responsible for these errors. Following the initial error analysis presented in chapter 8, we performed a set of experiments with an extended feature model and linguistically moti- vated features. The features of animacy, definiteness and referentiality were motivated by linguistic studies employing typological, theoretical and psy- cholinguistic data and found to be important in argument differentiation, as presented in chapter 3. Furthermore, features representing case, tense and voice were features which approximated defining properties of argument realization in Scandinavian type languages, as presented in chapter 4. Each feature indi- vidually caused a significant improvement in terms of overall labeled accuracy, performance for argument relations, and error reduction for the specific types of errors performed by the baseline parser. We furthermore established that the replacement of verbal tense with the property of finiteness significantly im- proved the effect of verbal features. We also achieved a cumulative effect in the combination of the features which differed significantly from the baseline, as well as each of the individual experiments (p<.0001). Moreover, resulting error analyses revealed the acquisition of functional preferences for a range of argument relations and linguistic features in line with the observations in chapter 3. 10.1 Main contributions 265 Comparative experiments on identical data sets with a graph-based depen- dency parsed, MSTParser, gave significant improvements in results also here. Moreover, the results highlighted the importance of conditioning on the pre- vious linguistic context for improved argument disambiguation, and lead us to experiment with feature locality in MaltParser. With a highly local feature model, where additional linguistic features were limited to candidate head and dependent during parsing, we found that our features gave clear improvements, however, lower effects in terms of argument disambiguation. The features employed initially were gold standard features taken from the treebank annotation. The scalability of the results achieved with the gold stan- dard annotation was addressed and largely confirmed in section 9.4. Similar, although slightly lower, results in terms of parse performance were achieved with a set of automatically acquired features taken largely from a generic part- of-speech tagger. We applied the animacy classifier developed in chapter 7 and found that it captured linguistic distinctions which proved important for the disambiguation of arguments. The addition of automatically acquired animacy information for common nouns resulted in a significant improvement of over- all parse results and was shown to give as good, or even slightly better, results than the gold standard counterpart. Error analysis revealed the influence of gradient animacy categories. 10.1.3 Argument differentiation The ability to distinguish between different types of arguments is central to syntactic analysis, and the way in which this is done is dependent on a range of interacting factors. In this thesis, we have approached the topic of argu- ment differentiation by establishing a set of argument distinctions and a set of linguistic dimensions which we hypothesize to be correlated. We have further- more argued that data-driven models can provide an elucidating perspective on the coupling of arguments and linguistic properties without an explicit ex- pression of a set of constraints. Generalization over language data has shown consistent statistical tendencies in argument realization and we have seen how these may be employed to acquire linguistic categories. In the initial chapter we posed two main research questions, repeated below: 1. How are syntactic arguments differentiated? • Which linguistic properties differentiate arguments? • How do linguistic properties interact to differentiate an argument? 266 Concluding remarks 2. How may we capture argument differentiation in data-driven models of language? What are the effects? More than anything, we hope to have shown in this thesis that these are interest- ing and worthwhile questions. We have also provided some possible answers which highlight different aspects of argument differentiation. First of all, we may conclude that differentiation of arguments takes place relative to several distinctions in argumenthood, such as the distinction be- tween the two arguments of a transitive construction, the subject and object, the core and non-core arguments, as well as the main distinction between argu- ments and non-arguments. Our results emphasize this relative nature of argu- ment differentiation. The most important distributional features employed for animacy classification, such as the SUBJ, OBJ and GEN features, provide infor- mation regarding important distinctions in argumenthood, hence environments which tend to exhibit differential properties. The parse experiments highlight this point further, and we find that error reduction in terms of argument disam- biguation increases in line with conditioning on the linguistic context in terms of grammatical relations and linguistic features. The formulation of question 1, which asks how arguments are differenti- ated, furthermore points to part of the answer, namely: through a set of lin- guistic properties. In this thesis we have identified several such properties, and, perhaps more importantly, attempted to explicate and evaluate the con- ditions under which these properties affect syntactic argumenthood. We have examined linguistic properties, such as animacy, definiteness and referentiality, which are relevant to a range of languages, as well as more language-specific properties relating to morphological and structural properties, such as case and finiteness. With respect to levels of linguistic analysis these properties repre- sent a mixture, ranging from semantic and discourse-oriented properties, as well as morphosyntactic ones. We propose that the interaction of the linguis- tic properties is probabilistic and that the frequency distribution of linguistic properties relative to different distinctions in argumenthood in language data directly determines their importance in argument differentiation. These results are clearly compatible with a view of argumenthood as determined by a set of soft constraints, as suggested by theoretical and psycholinguistic work. As mentioned above, one of the main contributions of this thesis is method- ological and a large portion of this thesis has therefore been concerned with the second question posed initially, i.e., how argument differentiation may be captured in data-driven models. In the introduction, we proposed to employ theoretical proposals regarding argument differentiation to motivate the defini- tion of data-driven learning problems and thereby to guide generalization from language data. In order to capture argument differentiation we have formulated 10.2 Future work 267 features which generalize over syntactic arguments as well as their linguistic properties. In general, we have seen that a separation of a notion of argumenthood from specific morphosyntactic realization has been a key component in the data-driven modeling of argument differentiation. This was achieved through distributional features in Part II, which generalize over specific structural real- ization, such as word order. This representation was shown to capture statistical tendencies in argumenthood with respect to the linguistic property of animacy. In Part III, we studied argument differentiation in context through the study of argument disambiguation in a data-driven dependency parser for Swedish. The error analysis showed that further improvement of argument analysis was partly dependent on properties of argument realization other than word order and morphology. The separation of functional arguments from structural posi- tion which characterizes dependency analysis enabled the acquisition of func- tional generalizations irrespective of structural realization. For Scandinavian type languages, which are characterized by considerable word order variation and lack of morphological marking, the separation of function from structural realization constitutes an important property. The acquisition of soft, func- tional constraints is furthermore clear from the type of improvement which the added information incurred. We found improvement largely in labeled results caused by disambiguation of grammatical functions, rather than structural po- sitions (attachment). For instance, for the errors confusing subjects for objects and vice versa, which were largely errors in labeling, we observed an error re- duction of 44–46% in the experiments combining all features. We found that a majority of the improved errors were arguments which were non-canonical in some sense, i.e., departing from the most frequent structural and morpho- logical properties. Improvement thus relied on other properties of argument relations and the abstraction over specific realization in terms of dependency relations. The representation of the linguistic properties introduced in chapters 3–4 has also constituted an important part of the data-driven modeling of argument differentiation. We have throughout the thesis explicitly stated that data-driven modeling relies largely on approximation. The formulation and evaluation of the features proposed to approximate linguistic properties of arguments has therefore constituted a central part of the work described above. 10.2 Future work A natural next step is to extend the studies performed here to other languages. The linguistic tendencies in argument differentiation presented in chapter 3 268 Concluding remarks have been attested in a range of different languages. The work on acquisition of animacy described in Part II is based on linguistic generalizations relevant to a wide range of languages, for instance English (Zaenen et al. 2004). It would therefore be interesting to experiment with animacy classification for other languages. Properties of the Scandinavian languages which may be con- nected with errors in argument assignment, see part III, turn out not to be iso- lated phenomena. A range of other languages exhibit similar properties, for instance, Italian exhibits word order variation, little case, syncretism in agree- ment morphology, as well as pro-drop; German exhibits a larger degree of word order variation in combination with quite a bit of syncretism in case morphol- ogy; Dutch has word order variation, little case and syncretism in agreement morphology. These are all examples of other languages for which the results described here are relevant. Work on subject-object disambiguation for Italian suggests that a very similar approach might be worth pursuing also for this language (Dell’Orletta et al. 2005, 2006). The cross-linguistic generalizability of our results can be tested empirically by data-driven dependency parsing of other languages with motivated features. Gradience in the animacy dimension was here addressed through experi- ments with a more fine-grained set of classes. However, one might also draw a more radical conclusion from the notion of gradience and dispense with dis- crete categories alltogether, opting for a continuous animacy dimension. As a first step, application of a soft clustering algorithm might provide an ex- ploratory overview. We have furthermore treated animacy as a lexical property of nouns and adopted a denotational treatment of this property through type- level classification. This has been shown to be a viable approach through the application of the animacy classifier during parsing. However, we have also discussed in several places, the influence of the linguistic context on referen- tial properties of nominal elements. An interesting possibility is thus to per- form token level animacy classification, where type-level classification consti- tutes some sort of prior probability (Brew and Lapata 2004), and in addition incorporating the specific linguistic context. In the automatic acquisition of features for parsing, detailed in section 9.4, we did not deal with the acquisition of non-referential subjects. As we have noted already, there are a range of quite different constructions which include non-referential subjects and the extent to which this is dependent on the ar- gument structure of the verb also varies. This makes classification of non- referential subjects a challenging, but also interesting task since it provides a more fine-grained picture of argument properties. Finally, the work described in this thesis has been largely concerned with syntactic arguments which are per definition subcategorized for by the verb. We have also discussed differentiation of arguments from non-arguments. The 10.2 Future work 269 error analysis for the baseline parser showed that confusion of different kinds of adverbials is a frequent error type in the parsing of Swedish, in addition to confusion of arguments. The case of adverbial disambiguation constitutes an area with several commonalities to that of argument disambiguation. First of all, adverbial placement is characterized by variation, in particular following the finite verb in Swedish (Andréasson 2007). Hence, an approach which op- erates with a separate level of functional analysis, like dependency grammar, may capture regularities irrespective of structural realization, much like the case of argument disambiguation. REFERENCES Aarts, Bas 2004. Conceptions of gradience in the history of linguistics. Lan- guage Sciences 26 (4): 343–389. Abeillé, Anne (ed.) 2003. Treebanks: Building and using parsed corpora. Dordrecht: Kluwer Academic Publishers. Ahrenberg, Lars 1990. A grammar combining phrase structure and field struc- ture. Proceedings of COLING-90, Volume 2, 1–6. Aissen, Judith 1999. Markedness and Subject choice in Optimality Theory. Natural Language and Linguistic Theory 17 (4): 673–711. Aissen, Judith 2003. Differential Object Marking: Iconicity vs. economy. Nat- ural Language and Linguistic Theory 21 (3): 435–483. Andréasson, Maia 2007. Satsadverbial, ledföljd och informationsdynamik i svenskan. Göteborgsstudier i Nordisk Språkvetenskap. Göteborg Univer- sity. Anttila, Arto 1997. Deriving variation from grammar: A study of Finnish gen- itives. Frans Hinskens, Roeland van Hout and W. Leo Wetzels (eds), Variation, change and phonological theory, 35–68. Amsterdam: John Benjamins. Baker, Mark 1983. Objects, themes and lexical rules in Italian. Lori Levin (ed.), Papers in Lexical-Functional Grammar, 1–45. Indiana Univ. Ling. Club. Baker, Mark 1997. Thematic roles and syntactic structure. Liliane Haege- man (ed.), Elements of grammar: Handbook of generative syntax, 73– 137. Dordrecht: Kluwer Academic Publishers. Baldwin, Timothy 2006. Data-driven methods for acquiring lexical semantics. Lecture notes from the ESSLLI 2006 Course on Data-Driven Methods for Acquiring Linguistic Information, Málaga, Spain. Bard, Ellen Gurman, Dan Robertson and Antonella Sorace 1996. Magnitude Estimation of linguistic acceptability. Language 72 (1): 32–68. Barlow, Michael and Suzanne Kemmer (eds) 2000. Usage based models of language. Stanford, CA: CSLI Publications. Baroni, Marco 2007. Distributions in text. Anke Lüdeling and Merja 272 References Kytö (eds), Corpus linguistics: An international handbook. Berlin: Mou- ton de Gruyter. Beaver, David I. and Hanjung Lee 2004. Input-output mismatches in OT. Rein- hard Blutner and Hank Zeevat (eds), Optimality theory and pragmatics, 112–153. Houndmills, Basingstoke, Hampshire: Palgrave/Macmillan. Bikel, Daniel M. 2004. Intricacies of Collins’ parsing model. Computational Linguistics 30 (4): 479–511. Blaheta, Don and Eugene Charniak 2000. Assigning function tags to parsed text. Proceedings of the North American chapter of the Association for Computational Linguistics (NAACL), 234–240. Bock, J. Kathryn and Richard K. Warren 1985. Conceptual accessibility and syntactic structure in sentence formulation. Cognition 21 (1): 47–67. Bod, Rens 1995. Enriching linguistics with statistics: Performance models of natural language. Ph.D. diss., University of Amsterdam. Bod, Rens 1998. Beyond grammar: An experience-based theory of language. Stanford, CA: CSLI Publications. Bod, Rens, Jennifer Hay and Stefanie Jannedy 2003. Introduction. Rens Bod, Jennifer Hay and Stefanie Jannedy (eds), Probabilistic linguistics, 289– 341. Cambridge, MA: MIT Press. Boersma, Paul and Bruce Hayes 2001. Empirical tests of the Gradual Learning Algorithm. Linguistic Inquiry 32 (1): 45–86. Boleda, Gemma 2007. Automatic acquisition of semantic classes for adjec- tives. Ph.D. diss., Pompeu Fabra University. Boleda, Gemma, Toni Badia and Eloi Batlle 2004. Acquisition of semantic classes for adjectives from distributional evidence. Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), 1119–1125. Börjars, Kersti 1998. Feature distribution in Swedish noun phrases. Oxford: Blackwell Publishers. Börjars, Kersti, Elisabet Engdahl and Maia Andréasson 2003. Subject and object positions in Swedish. Miriam Butt and Tracy Holloway King (eds), Proceedings of the LFG03 conference. Stanford, CA: CSLI Publications. Bouma, Gerlof 2008. Starting a sentence in Dutch: A corpus study of subject- and object-fronting. Ph.D. diss., Groningen University. Boyd, Adriane, Whitney Gegg-Harrison and Donna Byron 2005. Identifying non-referential it: A machine learning approach incorporating linguisti- cally motivated features. Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in NLP, 40–47. References 273 Branigan, Holly P., Martin J. Pickering and Mikihiro Tanaka 2008. Contri- butions of animacy to grammatical function assignment and word order production. Lingua 118 (2): 172–189. Bresnan, Joan 2001. Lexical-Functional Syntax. Malden, Mass.: Blackwell Publishers. Bresnan, Joan 2006. Is syntactic knowledge probabilistic? Experiments with the English dative alternation. Sam Featherston and Wolfgang Sterne- feld (eds), Roots: Linguistics in search of its evidential base. Berlin: Mouton de Gruyter. Bresnan, Joan, Anna Cueni, Tatiana Nikitina and Harald Baayen 2005. Pre- dicting the dative alternation. Gosse Bouma, Irene Kraemer and Joost Zwarts (eds), Cognitive foundations of interpretation, 69–94. Amster- dam: Royal Netherlands Academy of Science. Bresnan, Joan, Shipra Dingare and Christopher D. Manning 2001. Soft con- straints mirror hard constraints: Voice and person in English and Lummi. Miriam Butt and Tracy Holloway King (eds), Proceedings of the LFG01 conference. Stanford, CA: CSLI Publications. Bresnan, Joan and Jonni M. Kanerva 1989. Locative inversion in Chichewˆa: A case study of factorization in grammar. Linguistic Inquiry 20 (1): 1–50. Bresnan, Joan and Tatiana Nikitina 2007. The gradience of the dative alter- nation. Linda Uyechl and Lian Hee Wee (eds), Reality exploration and discovery: Pattern interaction in language and life. Stanford, CA: CSLI Publications. Brew, Chris and Mirella Lapata 2004. Verb class disambiguation using infor- mative priors. Computational Linguistics 30 (1): 45–73. Bröker, Norbert 1998. How to define a context-free backbone for DGs: Imple- menting a DG in the LFG formalism. Proceedings of the COLING-ACL workshop on Processing of Dependency Grammars, 29–38. Buchholz, Sabine 2002. Memory-based grammatical relation finding. Ph.D. diss., Tilburg University. Buchholz, Sabine and Erwin Marsi 2006. CoNLL-X shared task on multilin- gual dependency parsing. Proceedings of the Tenth Conference on Com- putational Natural Language Learning (CoNLL-X), 149–164. Buch-Kromann, Matthias 2006. Discontinuous Grammar: A model of hu- man parsing and language acquisition. Ph.D. diss., Copenhagen Business School. Bybee, Joan and Paul Hopper (eds) 2001. Frequency and the emergence of linguistic structure. Amsterdam: John Benjamins. Carlson, Greg 1980. Reference to kinds in English. New York: Garland Press. 274 References Carreras, Xavier and Lluís Màrquez 2005. Introduction to the CoNLL-2005 shared task: Semantic Role Labeling. Proceedings of CoNLL-2005, 89– 97. Carroll, Glenn and Mats Rooth 1998. Valence induction with a head- lexicalized PCFG. Proceedings of the 3rd Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), 36–45. Carroll, John 2000. Statistical parsing. Robert Dale, Hermann Moisl and Harold Somers (eds), Handbook of natural language processing, 525– 543. New York/Basel: Marcel Dekker. Carroll, John and Edward Briscoe 2002. High precision extraction of gram- matical relations. Proceedings of the 19th International Conference on Computational Linguistics (COLING), 134–140. Chang, Chih-Chung and Chih-Jen Lin 2001. LIBSVM: A li- brary for support vector machines. Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm. Charniak, Eugene 1996. Treebank grammars. Proceedings of the 13th national conference on Artificial Intelligence (AAAI), 1031–1036. Charniak, Eugene 1997. Statistical parsing with a context-free grammar and word statistics. Proceedings of the 14th national conference on Artificial Intelligence (AAAI), 598–603. Charniak, Eugene 2000. A maximum-entropy-inspired parser. Proceedings of the North American chapter of the Association for Computational Lin- guistics (NAACL), 132–139. Charniak, Eugene and Mark Johnson 2005. Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. Proceedings of the 43rd annual meet- ing of the Association for Computational Linguistics (ACL), 173–180. Chinchor, Nancy, Erica Brown, Liza Ferro and Patty Robinson 1999. 1999 Named Entity Recognition Task Definition. MITRE and SAIC. Version 1.4. Choi, Hye-Won 2001. Phrase structure, information structure, and resolution of mismatch. Peter Sells (ed.), Formal and empirical issues in Optimality Theoretic syntax, 17–62. Stanford, CA: CSLI Publications. Chomsky, Noam 1965. Aspects of the theory of syntax. Cambridge, MA: MIT Press. Chomsky, Noam 1975. The logical structure of linguistic theory. New York: Springer. Chomsky, Noam 1981. Lectures on government and binding. Holland: Foris Publications. References 275 Chomsky, Noam 1995. The minimalist program. Cambridge, MA: MIT Press. Chrupała, Grzegorz and Josef van Genabith 2006. Using machine-learning to assign function labels to parser output for Spanish. Proceedings of the COLING/ACL main conference poster session, 136–143. Collins, Michael 1996. A new statistical parser based on bigram lexical de- pendencies. Proceedings of the 34th annual meeting of the Association for Computational Linguistics (ACL), 184–191. Collins, Michael 1999. Head-driven statistical models for natural language parsing. Ph.D. diss., University of Pennsylvania. Comrie, Bernard 1989. Language universals and linguistic typology. Chicago, IL: University of Chicago Press. Covington, Michael A. 2001. A fundamental algorithm for dependency pars- ing. Proceedings of the 39th annual ACM southeast conference, 95–102. Croft, William 1990. Typology and universals. Cambridge: Cambridge Uni- versity Press. Croft, William 2003. Typology and universals. 2nd edition. Cambridge: Cam- bridge University Press. Crystal, David 1967. English. Lingua 17 (1): 24–56. Daelemans, Walter 1999. Memory-based language processing. Journal for Experimental and Theoretical Artificial Intelligence 11 (3): 287–467. Daelemans, Walter and Antal van den Bosch 2005. Memory-based language processing. Cambridge: Cambridge University Press. Daelemans, Walter, Antal van den Bosch and Jakub Zavrel 1999. Forgetting exceptions is harmful in language learning. Machine Learning 34 (1): 11–43. Daelemans, Walter, Jakub Zavrel, Ko Van der Sloot and Antal Van den Bosch 2004. TiMBL: Tilburg Memory Based Learner, version 5.1, Reference Guide. Technical Report, ILK Technical Report Series 04-02. Dahl, Östen 2000. Egophoricity in discourse and syntax. Functions of Lan- guage 7 (1): 37–77. Dahl, Östen 2008. Animacy and egophoricity: Grammar, ontology and phy- logeny. Lingua 118 (2): 141–150. Dahl, Östen and Kari Fraurud 1996. Animacy in grammar and discourse. Thorstein Fretheim and Jeanette K. Gundel (eds), Reference and refer- ent accessibility, 47–65. Amsterdam: John Benjamins. Dalrymple, Mary 2001. Lexical Functional Grammar. New York: Academic Press. 276 References Dell’Orletta, Felice, Alessandro Lenci, Simonetta Montemagni and Vito Pir- relli 2005. Climbing the path to grammar: A maximum entropy model of subject/object learning. Proceedings of the 2nd Workshop on Psychocom- putational Models of Human Language Acquisition, 72–81. Dell’Orletta, Felice, Alessandro Lenci, Simonetta Montemagni and Vito Pir- relli 2006. Probing the space of grammatical variation: Induction of cross- lingual grammatical constraints from treebanks. Proceedings of the Work- shop on Frontiers in Linguistically Annotated Corpora, 21–28. Diderichsen, Paul 1957. Elementær dansk grammatik. København: Gyldendal. Dietterich, Thomas G. 1998. Approximate statistical test for comparing su- pervised classification learning algorithms. Neural Computation 10 (7): 1895–1923. Dik, Simon C. 1989. The theory of functional grammar. Dordrecht: Foris. Dowty, David 1982. Grammatical relations and Montague grammar. Pauline Jacobson and Geoffrey K. Pullum (eds), The nature of syntactic represen- tations, 79–130. Dordrecht: Reidel. Dowty, David 1991. Thematic proto-roles and argument selection. Language 67 (3): 547–619. Dryer, Matthew S. 1986. Primary objects, secondary objects and antidative. Language 62 (4): 808–845. Eide, Kristin Mehlum 2008. Finiteness and inflection: The syntax your mor- phology can afford. Downloaded from http://ling.auf.net/lingBuzz on Dec 10, 2008. Einarsson, Jan 1976a. Talbankens skriftspråkskonkordans. Dept. of Scandina- vian languages, Lund University. Einarsson, Jan 1976b. Talbankens talspråkskonkordans. Dept. of Scandinavian languages, Lund University. Engdahl, Elisabet, Maia Andréasson and Kersti Börjars 2004. Word order in the Swedish midfield – an OT approach. Fred Karls- son (ed.), Proceedings of the 20th Scandinavian Conference of Linguis- tics. http://www.ling.helsinki.fi/kielitiede/20scl/proceedings.shtml. Erk, Katrin 2007. A simple, similarity-based model for selectional prefer- ences. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL), 216–223. Faarlund, Jan Terje, Svein Lie and Kjell Ivar Vannebo 1997. Norsk referanse- grammatikk. Oslo: Universitetsforlaget. Falk, Cecilia 1993. Non-referential subjects in the history of Swedish. Ph.D. diss., Dept. of Scandinavian languages, Lund University. References 277 Fass, Dan 1988. Metonymy and metaphor: What’s the difference? Proceed- ings of the 12th International Conference on Computational Linguistics (COLING), 177–181. Fellbaum, Christiane (ed.) 1998. Wordnet: an electronic lexical database. Cambridge, MA: MIT Press. Fillmore, Charles J. 1968. The case for case. Emmon Bach and Robert Thomas Harms (eds), Universals in linguistic theory, 1–88. New York: Holt, Rine- hart and Winston. Fraurud, Kari 1996. Cognitive ontology and NP form. Thorstein Fretheim and Jeanette K. Gundel (eds), Reference and referent accessibility, 65– 88. Amsterdam: John Benjamins. Frazier, Lyn 1985. Syntactic complexity. David Dowty, Lauri Karttunen and Arnold M. Zwicky (eds), Natural language parsing, 129–189. Cam- bridge: Cambridge University Press. Garretson, Gregory, M. Catherine O’Connor, Barbora Skarabela and Mar- jorie Hogan 2004. Optimal typology of determiner phrases cod- ing manual. Version 3.2. Boston University. Downloaded from http://people.bu.edu/depot/coding_manual.html on 02/15/2006. Givón, Talmy 1984. Syntax: A functional-typological introduction. Amster- dam: John Benjamins. Goldwater, Sharon and Mark Johnson 2003. Learning OT constraint rankings using a Maximum Entropy model. Jennifer Spenader, Anders Eriksson and Östen Dahl (eds), Proceedings of the Stockholm Workshop on Varia- tion within Optimality Theory, 111–120. Grimshaw, Jane 1990. Argument structure. Cambridge, MA: MIT Press. Gundel, Jeanette K., Nancy Hedberg and Ron Zacharski 1993. Cognitive status and the form of referring expressions. Language 69 (2): 274–307. Gustafson-Capková, Sofia and Britt Hartmann 2006. Manual of the Stockholm Umeå Corpus version 2.0. Dept. of Linguistics, Stockholm University. Hagen, Kristin, Janne Bondi Johannessen and Anders Nøklestad 2000. A constraint-based tagger for Norwegian. Carl-Erik Lindberg and Stef- fen Nordahl Lund (eds), Proceedings of the 17th Scandinavian Confer- ence of Linguistics. Hale, John and Eugene Charniak 1998. Getting useful gender statistics from English text. Technical Report, Comp. Sci. Dept. at Brown University, Providence, Rhode Island. Hall, Johan 2003. A probabilistic part-of-speech tagger with suffix probabili- ties. Master’s thesis, Växjö University, Sweden. 278 References Haspelmath, Martin 2006. Against markedness (and what to replace it with). Journal of Linguistics 42 (1): 25–70. von Heusinger, Klaus 2002. Specificity and definiteness in sentence and dis- course structure. Journal of Semantics 19 (3): 245–274. Hobbs, Jerry R. 1976. Pronoun resolution. Technical Report, City College of New York. Holmberg, Ander 1986. Word order and syntactic features in the Scandinavian languages and English. Ph.D. diss., Dept. of General linguistics, Univer- sity of Stockholm. Holmberg, Anders and Christer Platzack 1995. The role of inflection in Scan- dinavian syntax. New York/Oxford: Oxford University Press. Holmes, Philp and Ian Hinchliffe 2003. Swedish: A comprehensive grammar. London: Routledge. de Hoop, Helen and Monique Lamers 2006. Incremental distinguishability of subject and object. Leonid Kulikov, Andrej Malchukov and Peter de Swart (eds), Case, valency and transitivity. Amsterdam: John Ben- jamins. Hopper, Paul J. and Sandra A. Thompson 1980. Transitivity in grammar and discourse. Language 56 (2): 251–299. Hoste, Véronique 2005. Optimization issues in machine learning of corefer- ence resolution. Ph.D. diss., University of Antwerp. Hudson, Richard 1990. English word grammar. Oxford: Blackwell Publishers. Hundt, Marianne 2004. Animacy, agentivity and the spread of the progressive in Modern English. English Language and Linguistics 8 (1): 47–69. Jäger, Gerhard 2004. Learning constraint sub-hierarchies: The bidirectional Gradual Learning Algorithm. Reinhard Blutner and Hank Zeevat (eds), Optimality theory and pragmatics, 251–287. Houndmills, Basingstoke, Hampshire: Palgrave/Macmillan. Jäger, Gerhard and Anette Rosenbach 2006. The winner takes it all – almost: Cumulativity in grammatical variation. Linguistics 44 (5): 937–971. Joanis, Eric and Suzanne Stevenson 2003. A general feature space for au- tomatic verb classification. Proceedings of the 10th Conference of the European Association for Computational Linguistics (EACL), 163–70. Johannessen, Janne Bondi 1998. Tagging and the case of pronouns. Computers and the Humanities 32 (1): 1–38. Johansson, Richard and Pierre Nugues 2007. Extended constituent-to- dependency conversion for English. Joakim Nivre, Heiki-Jaan Kaalep and Mare Koit (eds), Proceedings of NODALIDA 2007, 105–112. References 279 Johnson, Mark 1998. PCFG models of linguistic tree representations. Compu- tational Linguistics 24 (4): 613–632. Jurafsky, Dan 2003. Probabilistic modeling in psycholinguistics: Linguistic comprehension and production. Rens Bod, Jennifer Hay and Stefanie Jannedy (eds), Probabilistic linguistics, 289–341. Cambridge, MA: MIT Press. Kager, René 1999. Optimality Theory. Cambridge: Cambridge University Press. Kaplan, Ronald M. and Joan Bresnan 1982. Lexical-Functional Grammar: A formal system for grammatical representation. Joan Bresnan (ed.), The mental representation of grammatical relations, 173–281. Cambridge, MA: MIT Press. Karlsson, Fred, Atro Voutilainen, Juha Heikkilä and Atro Anttila (eds) 1995. Constraint Grammar: A language-independent system for parsing unre- stricted text. Berlin: Mouton de Gruyer. Karypis, George 2002. Cluto: A clustering toolkit. Technical Report, Dept. of Computer Science, Univ. of Minnesota. Technical Report #02-017. Katz, Jerrold J. and Jerry A. Fodor 1963. The structure of semantic theory. Language 39 (2): 170–210. Keenan, Edward L. 1976. Towards a universal definition of “subject”. Charles N. Li (ed.), Subject and topic, 303–333. Cambridge, MA: Aca- demic Press. Keenan, Edward L. and Bernard Comrie 1977. Noun phrase accessibility and universal grammar. Linguistic Inquiry 8 (1): 63–99. Keller, Frank 2000. Gradience in grammar: Experimental and computational aspects of degrees of grammaticality. Ph.D. diss., University of Edin- burgh. Klein, Dan and Christopher D. Manning 2003. Accurate unlexicalized parsing. Proceedings of the 41st Annual Meeting of the Association for Computa- tional Linguistics (ACL), 423–430. Kokkinakis, Dimitrios 2001. A framework for the acquisition of lexical knowl- edge: Description and application. Ph.D. diss., Department of Swedish Language, Göteborg University. Kokkinakis, Dimitrios 2004. Reducing the effect of name explosion. Proceed- ings of the LREC Workshop: Beyond Named Entity Recognition, Semantic labelling for NLP tasks. Kübler, Sandra 2004. Memory-based parsing. Amsterdam: John Benjamins. Kübler, Sandra and Jelena Prokic´ 2006. Why is German dependency parsing 280 References more reliable than constituent parsing? Proceedings of the Fifth Work- shop on Treebanks and Linguistic Theories (TLT), 7–18. Kuhn, Jonas 2001. Generation and parsing in Optimality Theoretic syntax: Issues in the formalization of OT-LFG. Peter Sells (ed.), Formal and empirical issues in Optimality-theoretic syntax, 313–366. Stanford, CA: CSLI Publications. Kuno, S. and E. Kaburaki 1977. Empathy and syntax. Linguistic Inquiry 8: 627–672. Lakoff, George 1987. Women, fire and dangerous things: What categories reveal about the mind. Chicago, IL: University of Chicago Press. Lakoff, George and Mark Johnson 1980. Metaphors we live by. Chicago, IL: University of Chicago Press. Lappin, Shalom and Stuart Shieber 2007. Machine learning theory and practice as a source of insight into universal grammar. Journal of Linguistics 43 (2): 393–427. Levin, Beth 1993. English verb classes and alternations. Chicago, IL: Uni- versity of Chicago Press. Lin, Dekang 1998. Automatic retrieval and clustering of similar words. Pro- ceedings of the 17th International Conference on Computational Linguis- tics (COLING), Volume 2, 768–774. Lyons, Cristopher 1999. Definiteness. Cambridge: Cambridge University Press. Lyons, John 1977. Semantics. Cambridge: Cambridge University Press. MacDonald, Maryellen C., Neal J. Pearlmutter and Mark S. Seidenberg 1994. Lexical nature of syntactic ambiguity resolution. Psychological Review 101 (4): 676–703. Magerman, David M. 1995. Statistical decision-tree models for parsing. Pro- ceedings of the 33rd Annual Meeting of the Association for Computa- tional Linguistics (ACL), 276–283. Mak, Willem M., Wietske Vonk and Herbert Schriefers 2006. Animacy in pro- cessing relative clauses: The hikers that rocks crush. Journal of Memory and Language 54 (4): 466–490. Manning, Christopher D. 2003. Probabilistic syntax. Rens Bod, Jennifer Hay and Stefanie Jannedy (eds), Probabilistic linguistics, 289–341. Cam- bridge, MA: MIT Press. Manning, Christopher D. and Hinrich Schütze 1999. Foundations of statistical natural language processing. Cambridge, MA: MIT Press. Marcus, M. P., B. Santorini and M. A. Marcinkiewicz 1993. Building a large References 281 annotated corpus for English: The Penn treebank. Computational Lin- guistics 19 (2): 313–330. Markert, Katja and Malvina Nissim 2006. Metonymic proper names: A corpus-based account. Anatol Stefanowitsch and Stefan Th. Gries (eds), Corpus-based approaches to metaphor and metonymy, 152–174. Berlin: Mouton de Gruyter. Maruyama, Hiroshi 1990. Structural disambiguation with constraint propaga- tion. Proceedings of the 28th meeting of the Association for Computa- tional Linguistics (ACL), 31–38. McDonald, Ryan, Koby Crammer and Fernando Pereira 2005. Online large- margin training of dependency parsers. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), 91–98. McDonald, Ryan and Joakim Nivre 2007. Characterizing the errors of data- driven dependency parsing. Proceedings of the Eleventh Conference on Computational Natural Language Learning (CoNLL), 122–131. McDonald, Ryan, Fernando Pereira, Kiril Ribarov and Jan Hajic˘ 2005. Non- projective dependency parsing using spanning tree algorithms. Proceed- ings of the Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP), 525–530. McEnery, Tony and Andrew Wilson 1996. Corpus linguistics. Edinburgh: Edinburgh University Press. Megyesi, Beáta 2002. Shallow parsing with PoS taggers and linguistic fea- tures. Journal of Machine Learning Research 2: 639–668. Mel’c˘uk, Igor 1988. Dependency syntax: Theory and practice. Albany: State University of New York Press. Merlo, Paola and Gabriele Musillo 2005. Accurate function parsing. Pro- ceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 620–627. Merlo, Paola and Suzanne Stevenson 2001. Automatic verb classification based on statistical distributions of argument structure. Computational Linguistics 27 (3): 373–408. Merlo, Paola and Suzanne Stevenson 2004. Structure and frequency in verb classification. Proceedings of Incontro di Grammatica Generativa XXX. Mihalcea, Rada 2002. Instance based learning with automatic feature selection applied to word sense disambiguation. Proceedings of the 19th Interna- tional Conference on Computational Linguistics (COLING). Mihalcea, Rada 2006. Word sense disambiguation. Lecture notes from the ESSLLI 2006 Course on Word Sense Disambiguation, Málaga, Spain. 282 References Mikkelsen, Line Hove 2002. Reanalyzing the definiteness effect: Evidence from Danish. Working Papers in Scandinavian Syntax 69: 1–75. Mitchell, Tom M. 1997. Machine learning. New York: McGraw-Hill. Morante, R. and B. Busser 2007. ILK2: Semantic role labeling of Catalan and Spanish using TiMBL. Proceedings of the 4th International Workshop on Semantic Evaluations (SEMEVAL), 183–186. Nilsson, Jens and Johan Hall 2005. Reconstruction of the Swedish treebank Talbanken. MSI report 05067, School of Mathematics and Systems En- gineering, Växjö University. Nilsson, Jens, Johan Hall and Joakim Nivre 2005. MAMBA meets TIGER: Reconstructing a Swedish treebank from antiquity. Proceedings of the NODALIDA Special Session on Treebanks, 119–132. Nivre, Joakim 2003. An efficient algorithm for projective dependency parsing. Proceedings of the Eighth International Workshop on Parsing Technolo- gies, 149–160. Nivre, Joakim 2004. Incrementality in deterministic dependency parsing. In- cremental parsing: Bringing engineering and cognition together. Work- shop at ACL-2004, 50–57. Nivre, Joakim 2006. Inductive dependency parsing. Dordrecht: Springer. Nivre, Joakim, Johan Hall, Sandra Kübler, Ryan McDonald, Jens Nilsson, Se- bastian Riedel and Deniz Yuret 2007. CoNLL 2007 Shared Task on De- pendency Parsing. Proceedings of the CoNLL Shared Task session of EMNLP-CoNLL 2007, 915–932. Nivre, Joakim, Johan Hall and Jens Nilsson 2004. Memory-based dependency parsing. Proceedings of the Eleventh Conference on Computational Nat- ural Language Learning (CoNLL), 49–56. Nivre, Joakim, Jens Nilsson and Johan Hall 2006. Talbanken05: A Swedish treebank with phrase structure and dependency annotation. Proceedings of the fifth International Conference on Language Resources and Evalu- ation (LREC), 1392–1395. Nivre, Joakim, Jens Nilsson, Johan Hall, Güls¸en Eryigˇit and Svetoslav Mari- nov 2006. Labeled pseudo-projective dependency parsing with Support Vector Machines. Proceedings of the Conference on Computational Nat- ural Language Learning (CoNLL). van Noord, Gertjan 2004. Error mining for wide-coverage grammar engineer- ing. Proceedings of the 42nd Annual Meeting of the Association for Com- putational Linguistics (ACL), 446–453. Nunberg, Geoffrey 1979. The non-uniqueness of semantic solutions: Poly- semy. Linguistics and Philosophy 3: 143–184. References 283 Ora˘san, Constantin and Richard Evans 2001. Learning to identify animate references. Proceedings of the Workshop on Computational Natural Lan- guage Learning, 1–8. Ora˘san, Constantin and Richard Evans 2007. NP animacy resolution for anaphora resolution. Journal of Artificial Intelligence Research 29: 79– 103. Øvrelid, Lilja 2004. Disambiguation of syntactic functions in Nor- wegian: Modeling variation in word order interpretations con- ditioned by animacy and definiteness. Fred Karlsson (ed.), Proceedings of the 20th Scandinavian Conference of Linguistics. http://www.ling.helsinki.fi/kielitiede/20scl/proceedings.shtml. Øvrelid, Lilja 2005. Animacy classification based on morphosyntactic cor- pus frequencies: Some experiments with Norwegian nouns. Kiril Simov, Dimitar Kazakov and Petya Osenova (eds), Proceedings of the Workshop on Exploring Syntactically Annotated Corpora, 24–34. Øvrelid, Lilja 2006. Towards robust animacy classification using morphosyn- tactic distributional features. Proceedings of the EACL 2006 Student Re- search Workshop, 47–54. Øvrelid, Lilja and Joakim Nivre 2007. When word order and part-of-speech tags are not enough – Swedish dependency parsing with rich linguistic features. Proceedings of the International Conference on Recent Ad- vances in Natural Language Processing (RANLP), 447–451. Platzack, Christer 1987. Huvudsatsordföljd och bisatsordföljd. Ulf Tele- man (ed.), Grammatik på villovägar, 87–96. Solna: Esselte Studium. Pollard, Carl and Ivan A. Sag 1994. Head-driven Phrase Structure Grammar. Chicago, IL: University of Chicago Press. Prat-Sala, Mercè and Holly P. Branigan 2000. Discourse constraints on syntac- tic processing in language production: A cross-linguistic study in English and Spanish. Journal of Memory and Language 42 (2): 168–182. Pustejovsky, James 1991. The generative lexicon. Computational Linguistics 17 (4): 409–441. Quinlan, J. Ross 1993. C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann Publishers. Rahkonen, Matti 2006. Some aspects of topicalization in active Swedish declaratives: A quantitative corpus study. Linguistics 44 (1): 23–55. Resnik, Philip 1996. Selectional constraints: An information-theoretic model and its computational realization. Cognition 61 (1): 127–159. Rosenbach, Anette 2002. Genitive variation in english: Conceptual factors in synchronic and diachronic studies. Berlin/New York: Mouton de Gruyter. 284 References Rosenbach, Anette 2003. Aspects of iconicity and economy in the choice be- tween the s-genitive and the of-genitive in English. Günter Rohdenburg and Britta Mondorf (eds), Determinants of grammatical variation in En- glish, 379–411. Berlin/New York: Mouton de Gruyter. Rosenbach, Anette 2005. Animacy versus weight as determinants of grammat- ical variation in English. Language 81 (3): 613–644. Rosenbach, Anette 2008. Animacy and grammatical variation - findings from English genitive variation. Lingua 118 (2): 151–171. Sag, Ivan A. and Thomas Wasow 2008. Performance-compatible compe- tence grammar. Robert D. Borsley and Kersti Börjars (eds), Non- transformational theories of syntax. Oxford: Blackwell Publishers. Sag, Ivan A., Thomas Wasow and Emily M. Bender 2003. Syntactic theory: A formal introduction. 2nd edition. Stanford, CA: CSLI Publications. Sahlgren, Magnus 2006. The word-space model: Using distributional analy- sis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. Ph.D. diss., Stockholm University. Schröder, Ingo 2002. Natural language parsing with graded constraints. Ph.D. diss., Dept. of Computer Science, University of Hamburg. Schröder, Ingo, Horia F. Pop, Wolfgang Menzel and Kilian A. Foth 2001. Learning grammar weights using genetic algorithms. Proceedings of the International Conference on Recent Advances in Natural Language Pro- cessing (RANLP). Schütze, Hinrich 1998. Automatic word sense discrimination. Computational Linguistics 24 (1): 97–122. Seidenberg, Mark S. and Maryellen C. MacDonald 1989. A probabilistic con- straints approach to language acquisition and processing. Cognitive Lin- guistics 23 (4): 569–588. Sgall, Peter, Eva Hajico˘vá and Jarmila Panevová 1986. The meaning of the sentence in its pragmatic aspects. Dordrecht: Reidel. Siewierska, Anna 1988. Word order rules. London: Croom Helm. Silverstein, Michael 1976. Hierarchy of features and ergativity. Robert M.W. Dixon (ed.), Grammatical categories in Australian languages, 112–171. Canberra: Australian Institute of Aboriginal Studies. Stevenson, Suzanne and Eric Joanis 2003. Semi-supervised verb class discov- ery using noisy features. Proceedings of the Conference on Computa- tional Natural Language Learning (CoNLL), 71–78. Stevenson, Suzanne and Paul Smolensky 2005. Optimality in sentence pro- References 285 cessing. Paul Smolensky and Geraldine Legendre (eds), The harmonic mind, 827–860. Cambridge, MA: MIT Press. Sveen, Andreas 1996. Norwegian impersonal constructions and the unac- cusative hypothesis. Ph.D. diss., University of Oslo. de Swart, Peter 2007. Cross-linguistic variation in object marking. Ph.D. diss., Netherlands Graduate School of Linguistics. de Swart, Peter, Monique Lamers and Sander Lestrade 2008. Animacy, argu- ment structure and argument encoding: Introduction to the special issue on animacy. Lingua 118 (2): 131–140. Teleman, Ulf 1974. Manual för grammatisk beskrivning av talad och skriven svenska. Lund: Studentlitteratur. Teleman, Ulf, Staffan Hellberg and Erik Andersson 1999. Svenska Akademiens Grammatikk. Stockholm: Nordstedts. Tjong Kim Sang, Erik 2002a. Memory-based named entity recognition. Pro- ceedings of the Conference on Computational Natural Language Learn- ing (CoNLL), 203–206. Tjong Kim Sang, Erik F. 2002b. Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition. Proceedings of the Conference on Computational Natural Language Learning (CoNLL), 155–158. Tjong Kim Sang, Erik F. and Sabine Buchholz 2000. Introduction to the CoNLL-2000 Shared Task: Chunking. Proceedings of the Conference on Computational Natural Language Learning (CoNLL). Schulte im Walde, Sabine 2006. Experiments on the automatic induction of German semantic verb classes. Computational Linguistics 32 (2): 159– 194. Schulte im Walde, Sabine 2007. The induction of verb frames and verb classes from corpora. Anke Lüdeling and Merja Kytö (eds), Corpus linguistics. an international handbook. Berlin: Mouton de Gruyter. Wasow, Thomas, Amy Perfors and David Beaver 2005. The puzzle of ambi- guity. Orhan Orgun and Peter Sells (eds), Morphology and the web of grammar: Essays in memory of Steven G. Lapointe, 265–282. Stanford, CA: CSLI Publications. Weber, Andrea and Karin Müller 2004. Word order variation in German main clauses: A corpus analysis. Proceedings of the 20th International Con- ference on Computational Linguistics, 71–77. Weckerly, J. and M. Kutas 1999. An electrophysiological analysis of animacy effects in the processing of object relative sentences. Psychophysiology 36: 559–570. 286 References Yamada, Hiroyasu and Yuji Matsumoto 2003. Statistical dependency analy- sis with support vector machines. Gertjan Van Noord (ed.), Proceedings of the Eighth International Workshop on Parsing Technologies (IWPT), 195–206. Yamamoto, Mutsumi 1999. Animacy and reference: A cognitive approach to corpus linguistics. Amsterdam: John Benjamins. Zaenen, Annie, Jean Carletta, Gregory Garretson, Joan Bresnan, Andrew Koontz-Garboden, Tatiana Nikitina, M. Catherine O’Connor and Tom Wasow 2004. Animacy encoding in English: why and how. Donna By- ron and Bonnie Webber (eds), Proceedings of the ACL Workshop on Dis- course Annotation. Zeevat, Hank and Gerhard Jäger 2002. A reinterpretation of syntactic align- ment. D. de Jongh, H. Zeevat and M. Nilsenova (eds), Proceedings of the 3rd and 4th International Symposium on Language, Logic and Computa- tion. Amsterdam. Zhao, Yang and George Karypis 2003. Criterion functions for document clus- tering. Technical Report, Dept. of Computer Science, Univ. of Minnesota.