LIVE and LEARN
Festschrift in honor of Lars Borin

INSTITUTIONEN FÖR SVENSKA, 
FLERSPRÅKIGHET OCH SPRÅKTEKNOLOGI
GU-ISS-2022-03
Forskningsrapporter från Institutionen för svenska, flerspråkighet 
och språkteknologi, Göteborgs universitet 
Research Reports from the Department of Swedish, Multilingualism, 
Language Technology
ISSN 1401-5919

LIVE and LEARN
Festschrift in honor of Lars Borin
Editors: Elena Volodina, Dana Dannélls, 
Aleksandrs Berdicevskis, Markus Forsberg, Shafqat Virk
GU-ISS-2022-03
Forskningsrapporter från Institutionen för svenska, flerspråkighet och 
språkteknologi, Göteborgs universitet 
Research Reports from the Department of Swedish, Multilingualism, 
Language Technology, University of Gothenburg
© 2022 The authors (individual papers), the editors (collection) and
Department of Swedish, Multilingualism, Language Technology
University of Gothenburg
Box 200
SE-405 30 Gothenburg
ISSN 1401-5919
ISBN 978-91-87850-82-0 (print)
ISBN 978-91-87850-83-7 (digital)
Cover: Design by Karin Wenzelberg based on word cloud illustration by Elena Volodina
Compilation: Elena Volodina, Dana Dannélls and Gerlof Bouma
Photo of Lars Borin by Johan Wingborg
Printed by BrandFactory AB Kållered 2022
This collection and all papers in it are licenced under a Creative Commons Attribution 4.0 
International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/
Festschrift in honor of Lars Borin
Preface
This volume is dedicated to Lars Borin, who for many years has been our dear colleague and an inspiring
scientific leader.
Lars Borin, born on February 2, 1957, is a professor of Natural Language Processing at the University
of Gothenburg, and a co-director of the R&D unit Språkbanken Text at the same university. He is also
the director of Nationella språkbanken, a Swedish nationwide e-infrastructure for language technology,
and the director of the Swedish node of CLARIN ERIC, an EU infrastructure for language technology.
Lars’ research activities are directed at the development of language resources and tools for all con-
temporary and historical varieties of written Swedish. Methodological and theoretical links between gen-
eral linguistics and NLP are one of his central research interests. Such links are crucial both for linguistic
research and for the incorporation of the knowledge gained through this research into increasingly sophis-
ticated language processing systems. He specializes in many fields, the most prominent being language
technology infrastructure, digital language resources, digital historical linguistics, computational lexi-
cography, lexical semantics, language typology, digital humanities, computer-assisted language learning,
multi-word expressions, and text corpora. Lars is well-known for his work at Språkbanken Text, where
he has been and still is a mastermind and driving force behind making large collections of corpora and
computer-readable lexicons of modern and historical Swedish. He has always been concerned about mak-
ing all of the resources openly available for users, both for download and for searches through web search
interfaces. Over the course of his career, he has strengthened the status of Swedish language technology
in Sweden and in the world. Thanks to his efforts, Språkbanken’s resources and tools have become the
source for developing state-of-the-art methods in various research fields such as Swedish linguistics,
computational linguistics, historical linguistics, digital humanities and social science, history, language
acquisition, and many others.
It is with great pleasure that we present Lars with this Festschrift to honor his lasting contributions,
nationally and internationally, and to highlight his importance to his friends and colleagues at the Uni-
versity of Gothenburg and elsewhere in the world. The Tabula Gratulatoria included in this volume lists
the names of many friends and colleagues, albeit certainly not all, who want to join in paying tribute to
Lars on his 65th anniversary. The contributions to the Festschrift reflect only a fraction of Lars’ scientific
interests. They come from his friends and colleagues around the world and deal with topics that have been
– in one way or another – inspired by his work. A common theme for the articles is the never-ending need
to learn, which is alluded to in the title of the volume, Live and Learn.
Gothenburg, November 2022
The editors
iii

Tabula Gratulatoria
Yvonne Adesam Normunds Grūzītis
Magnus Ahltorp Marianne Gullberg
Lars Ahrenberg Roger Gyllin
Karin Aijmer Martin Hammarstedt
Cai Alfredson Harald Hammarström
David Alfter Karin Helgesson
Christiane Andersen Simon Hengchen
Karin Andersson Louise Holmer
Aleksandrs Berdicevskis David House
Ulf Bjereld Christine Howes
Lars Björk Lars Ilshammar
Kristian Blensenius Jonas Ingvarsson
Gerlof Bouma Sofie Johansson
Johan Boye Richard Johansson
Daniel Brodén Arne Jönsson
Lars Burman Mats Jönsson
Love Börjeson Jelena Kallas
Nicoletta Calzolari Jussi Karlgren
Dick Claésson Susanna Karlsson
Bernard Comrie Jenny Kierkemann
Robin Cooper Malin Klang (f.d. Ahlberg)
Evie Coussé Per Klang (f.d. Malm)
Mats Dahllöf Dimitrios Kokkinakis
Dana Dannélls Marco Kuhlmann
Koenraad De Smedt Murathan Kurfalı
Marie Demker Hans Landqvist
Simon Dobnik Shalom Lappin
Rickard Domeij Staffan Larsson
Jens Edlund Ann Lillieström
Adam Ek Anna Lindahl
Elisabet Engdahl Cecilia Lindhé
Stina Ericsson Therese Lindström Tiedemann
Gunnar Eriksson Krister Lindén
Ghazaleh Esfandiari Baiat Peter Ljunglöf
Markus Forsberg Sharid Loáiciga
karin Friberg Heppin Benjamin Lyngfelt
Johan Frid Lennart Lönngren
Mats Fridlund Mats Malm
Antoaneta Granberg Arianna Masciolini
Johannes Graën Arild Matsson
Eleni Gregoromichelaki Beáta Megyesi
Magnus Merkel Natalia Sathler Sigiliano
Detmar Meurers Baiba Saulīte
Tommaso M. Milani Anju Saxena
Yousuf Ali Mohammed Anne Schumacher
Felix Morger Maria Skeppstedt
Ricardo Muñoz Sánchez Emma Sköldberg
Gunta Nešpore-Bērzkalne Sara Stymne
Jenny Nilsson Anna Sågvall Hein
Kristina Nilsson Björkenstam Nina Tahmasebi
Sanni Nimb Jennica Thylin-Klaus
Joakim Nivre Jörg Tiedemann
Catrin Norrby Tiago Timponi Torrent
Joel Olofsson Maria Toporowska Gronostaj
Sussi Olsen Jonatan Uppström
Anders Olsson Shafqat Virk
Leif-Jöran Olsson Martin Volk
Bolette Pedersen Elena Volodina
Stellan Petersson Michelle Waldispühl
Miriam R. L. Petruck Barbro Wallgren Hemlin
Eva Pettersson Åsa Wengelin
Ildikó Pilán Lena Wenner
Julia Prentice Søren Wichmann
Taraka Rama Mats Wirén
Aarne Ranta Victor Wåhlstrand Skärström
Judy Ribeck Nyström Niklas Zechner
Lena Rogström Torsten Zesch
Jacobo Rouces González Alexander Ziem
Johan Roxendal Maria Öhrman
Stian Rødven-Eide Robert Östling
Magnus Sahlgren Lilja Øvrelid
Contributed papers
Att vara Lars: Några tankar om språkteknologi och socioonomastik
Lars Ahrenberg 1
Wemay actually all die tomorrow... nevertheless: Predicting short-term frequency changes in Swedish
neologisms
Aleksandrs Berdicevskis, Yvonne Adesam and Evie Coussé 5
avokado-r/-er/-s/-sar
Kristian Blensenius and Louise Holmer 13
Counting dirty words: The effect of OCR quality on token statistics in historical Swedish corpora
Gerlof Bouma and Yvonne Adesam 17
Investigating a linguistic mini landscape: The Tsez (Dido) dialect dictionary project
Bernard Comrie 25
Beyond strings of characters: Resources meet NLP – Again
Dana Dannélls, Tiago Timponi Torrent, Natalia Sathler Sigiliano and Simon Dobnik 29
Ordvektorer i lexikografiskt arbete
Markus Forsberg and Emma Sköldberg 37
The diachrony of political terror: Tracing terror and terrorism in Swedish parliamentary data 1867-
1970
Mats Fridlund, Daniel Brodén and Victor Wåhlstrand Skärström 43
UD-based Latvian FrameNet
Normunds Grūzītis, Gunta Nešpore-Bērzkalne and Baiba Saulīte 49
The rise and fall of grammatical theories in descriptive grammars of the languages of the world
Harald Hammarström 55
Coveting your neighbor’s wife: Using lexical neighborhoods in substitution-based word sense disam-
biguation
Richard Johansson 61
Linguistics concepts as semantic frames
Per Klang and Shafqat Mumtaz Virk 67
Deep learning models as theories of linguistic knowledge
Shalom Lappin 73
vii
The case for Språkbanken Dialog
Staffan Larsson, Christine Howes and Eleni Gregoromichelaki 79
Ordbildning på icke-verbal grund. Om integrering av ikoner och ljudeffekter i grammatiken
Benjamin Lyngfelt and Joel Olofsson 83
Building a multilingual AWE tool for L2 learners: Challenges and ideas
Arianna Masciolini 89
Flexible Universal Dependencies using nested dependency graphs
Joakim Nivre 95
Perspective_on: Semantic relations for frames and constructions
Miriam R.L. Petruck and Alexander Ziem 101
Natural language processing for educational applications: Recent advances
Ildikó Pilán 107
Detecting fake papers with the latent algorithm for recursive search
Stian Rødven-Eide and Ricardo Muñoz Sánchez 111
Leksikalsk-semantiske sprogressourcer: Hvad kan de, og hvordan udvikler vi dem bedst?
Bolette Sandford Pedersen, Sanni Nimb and Sussi Olsen 115
Den ena texten och den andra: Visualisering av textpar med hjälp av ordmoln
Maria Skeppstedt, Gunnar Eriksson, Magnus Ahltorp and Rickard Domeij 121
Datorlingvistikens vagga i Uppsala
Anna Sågvall Hein 127
From open parallel corpora to public translation tools: The success story of OPUS
Jörg Tiedemann 133
Binomials in Swedish corpora – ‘Ordpar 1965’ revisited
Martin Volk and Johannes Graën 139
ICALL: Research versus reality check
Elena Volodina and David Alfter 145
Lyxig språklig födelsedagspresent from the Swedish Word Family
Elena Volodina, Yousuf Ali Mohammad and Therese Lindström Tiedemann 153
Annotating the narrative: A plot of scenes, events, characters and other intriguing elements
Mats Wirén, Adam Ek and Murathan Kurfalı 161
The other SAT-Solver: Applying lexicons to SweSAT word questions
Niklas Zechner 167
Mot en mänskligare maskinöversättning
Robert Östling 171
viii
Att vara Lars: Några tankar om språkteknologi och socioonomastik
Lars Ahrenberg
Institutionen för datavetenskap
Linköpings universitet, Sverige
lars.ahrenberg@liu.se
Abstract
Since the SweClarin project began in 2015 its resources in terms of data and tools have been
used in many different projects including linguistics. A research area where they have been less
employed is the study of names. In this paper I suggest that language technology and general
corpora can be used to contribute to the sociological study of personal names and offer a few
examples. As is fit for the occasion I take Lars as the point of departure.
1 Namnforskning och språkteknologi
Namnforskning är ett forskningsområde med långa anor. Om det traditionellt hade ett fokus på ortnamn
och etymologier har området vidgats och omfattar i dag många olika slags namn och beforskas med olika
metoder. Ett livaktigt delområde är socioonomastiken, eller det sociolingvistiska studiet av namn.1 Även
om antalet forskare är litet har på senare tid ett antal initiativ tagits för att utveckla området i Norden. Till
den samarbetskommitté, NORNA, som funnits sedan 1971 finns nu ett forskarnätverk, New Trends in
Nordic Socio-onomastics med en aktiv webbplats, och en tidskrift, Nordisk tidskrift för socioonomastik
som utkom med sitt första nummer 2021. Den publicerar, enligt sin hemsida, vetenskapliga artiklar som
behandlar egennamnens roll i samhället och i social interaktion. Den är tvärvetenskaplig och välkomnar
bidrag från olika discipliner. Det innebär att författare tillåts använda en bredd av teorier, metoder och
perspektiv för att analysera namn liksom att kombinera olika typer av data.2
Jag har letat i ovan nämnda fora efter artiklar och blogginlägg som använder storskalig korpusanalys
eller språkteknologi i någon form, dock utan att hitta några. Metodmässigt används förutom register
också texter av olika slag: enkäter, intervjuer, inspelade samtal och historiska källor. Stora korpusar av
det slag som Språkbanken Text tagit fram genom åren och användning av språkteknologiska verktyg lyser
däremot med sin frånvaro, detta trots att ett infrastrukturprojekt som Swe-Clarin nu varit i gång sedan
2015. Jag tycker därför att det är på sin plats att spekulera över hur språkteknologi skulle kunna bidra till
namnforskning. Jag begränsar mig här till förnamn, specifikt med utgångspunkt i mitt eget (och dagens
jubilars), alltså Lars, och frågar mig hur sådana material som den så kallade Gigawordkorpusen (Eide
et al., 2016) kan användas för detta syfte. En underliggande fråga är om vi, som i Sverige under mitten
av det förra århundradet givits tilltalsnamnet Lars, har haft nytta av det. Olika synlighet i exempelvis
nyhetsmedia skulle kunna vara en indikation på att namnet gör skillnad.
Personnamn, och specifikt förnamn, bär på sociala betydelser. De flesta förnamn i Sverige får oss att
dra mer eller mindre säkra slutsatser om kön, ålder, etnicitet, familjebakgrund, kulturell och religiös till-
hörighet, med mera. Man kan till exempel jämföra Lars med Lauri, Laurent eller Laura. Emilia Aldrin
studerade i sin avhandling via enkäter och intervjuer föräldrars inställningar till namnval för deras ny-
födda och menar att dessa kan karaktäriseras i termer av social positionering (Aldrin, 2011, 67f). Lars
ingår inte i dessa diskussioner men utifrån hennes kategorier kategoriserar jag det som svenskorienterat,
snarare än internationellt, traditionellt snarare än modernt, och vanligt snarare än originellt.
1https://www.nordicsocioonomastics.org/about-socio-onomastics/
2https://gustavadolfsakademien.se/tidskrifter/tidskrift/nordisk-tidskrift-for-socioonomastik-nordic-journal-of-socio-
onomastics
Lars Ahrenberg. 2022. Att vara Lars: Några tankar om språkteknologi och socioonomastik. In
Volodina, Dannélls, Berdicevskis, Forsberg and Virk (editors), Live and Learn – Festschrift in
honor of Lars Borin, pages 1–4. Available under CC BY 4.0 1
1920-29 1930-39 1940-49 1950-59 1960-69
Placering 7 3 1 1 5
Antal 3799 12213 26570 24130 18810
Tabell 1: Lars som tilltalsnamn under perioden 1920-1970. Källa: SCB:s namnstatistik, tabell
tilltalsnamn-man-per-decennium-1920-2020-topp-10.
2 Vad kan språkteknologisk infrastruktur bidra med?
Sociala betydelser borde kunna studeras även i stora korpusar som Gigawordkorpusen. Att dessa inte är
socialt neutrala är välkänt, vilket ofta uppfattas som negativt, som en ’bias’ vilken måste åtgärdas för att
t.ex. de modeller som genereras från dem ska kunna användas. Men för att studera sociala betydelser är
det snarare nödvändigt att korpusen ger en så representativ bild som möjligt av den tid eller det samman-
hang den omfattar. På korpusen kan man sedan tillämpa de metoder som språkteknologin utvecklat för
att modellera data och då specifikt namnanvändning. Namnanvändning kan jämföras via samförekomster
med begrepp eller via sentimentanalys. Även ordinbäddningar borde kunna användas på samma sätt som
exempelvis (Garg et al., 2018) analyserat etniska stereotyper.
Man kan invända att textkorpusar inte handlar om namn utan om deras referenter, för personnamn
alltså om de personer som namnges. Kan vi utgå från enskilda beskrivningar av dessa personer, vad de
gör och vad de utsätts för, till utsagor om namnen som används? Jag vill hävda att vi kan det, under
förutsättning att korpusen är representativ för sin tid och så pass stor att den omfattar tillräckligt många
personer med samma namn. I så fall kan vi betrakta dessa personer som en social kohort utifrån deras
namn. De skillnader som eventuellt finns i modeller och språkliga associationer som vi kan ta fram kan
då knytas till namnet snarare än till de enskilda personerna. Visserligen är sådana skillnader i grunden
statistiska i den mån de kan säkerställas; men de kan ändå vara intressanta och frågan om vad de kan
bero på är främst en uppgift för namnsociologin att besvara.
3 Lars
Lars har använts som tilltalsnamn i Sverige åtminstone sedan tidig kristen tid. Såväl hög som låg har
burit namnet; biskopslängden för Linköpings stift omnämner en biskop Lars under 1200-talet, men även
fattigfolket har hetat så, som i Frödings dikt Lars i Kuja (... ty allt som växer åt Lars är sten och sten är
dålig förtäring).
Namnet hade en lång period av hög popularitet som tilltalsnamn under 1900-talet, framför allt under
40- och 50-talen, se Tabell 1. Det var enligt SCB alltjämt det vanligaste tilltalsnamnet för svenska män år
2021.3 Efter 1970 har populariteten avtagit för att under 2000-talet ha handlat om ett tjugotal eller ännu
färre nyfödda pojkar som givits det som tilltalsnamn.
4 Data
Data för analyserna är huvudsakligen hämtade från Gigaword-korpusen sammanställd vid och nerladd-
ningsbar från Språkbanken Text (Eide et al., 2016). Jag har begränsat mig till nyhetstexterna och tre
årtionden 1990-tal, 2000-tal, 2010-tal, där det senare decenniet slutar med 2013 för nyhetstexterna. Ny-
hetstexterna valdes därför att de bäst speglar det offentliga Sverige. För vissa analyser delades materialet
från perioden 2000-2009 upp på tre delkorpusar och det från 2010-2013 i två delkorpusar. Antal meningar
och token framgår av Tabell 2.
5 Analysexempel
Det faktum att Lars är ett så vanligt namn borde innebära att det är vanligt också i korpusen. Så är det
också, men det är inte vanligast. Lars är det vanligaste mansnamnet i 1990-talsdelen men sett över hela
3https://www.scb.se/hitta-statistik/sverige-i-siffror/namnsok/
2
Delkorpus Meningar Tokens
news1990 6,321,173 95,435,081
news2000-01 5,700,000 99,093,472
news2000-02 5,700,000 99,240,929
news2000-03 6,012,341 90,238,180
news2010-01 5,200,000 86,840,551
news2010-02 5,592,318 82,157,754
Alla 34,525,832 553,005,967
Tabell 2: Antal meningar och token i nyhetsdelen av Gigawordkorpusen.
delkorpusen är Anders och Peter vanligare, se Tabell 3. Att ta fram frekvensdata för alla namn i korpusen
kräver disambiguering; vi vill bara ha med de förekomster som faktiskt anger en person. Att göra detta
exakt är inte helt enkelt, eftersom många av de vanligaste namnen (Lars, Göran, Helena, ...) ingår i namn
på andra saker som kyrkor, stadsdelar och sjukhus, medan andra namn som Stig, Bo, Sten, ... kan vara
något annat än egennamn. Baserat på stickprov bedömde jag att cirka 44% av alla förekomster av Hans
är possessiva pronomina. Man bör också ha i åtanke att många namn är populära utanför Sverige, så som
till exempel Peter, som kommer högt upp på listan.
Delkorpus Vanligaste mansnamn i ordning
news1990-99 Lars, Anders, Peter, Jan, Göran
news2000-09 Anders, Peter, Lars, Johan, Fredrik
news2010-13 Anders, Johan, Peter, Fredrik, Lars
Hela (1990-2013) Anders, Peter, Lars, Johan, Fredrik
Tabell 3: De fem vanligaste mansnamnen i olika delar av Gigawordkorpusen.
En fråga värd att undersöka är hur sambandet ser ut mellan förekomst av namn i nyhetsmedia och
förekomst av namnen i befolkningen. För att undersöka det har jag prövat att korrelera namnstatistiska
data från SCB:s tabeller med frekvensdata i Gigawordkorpusens nyhetsdel för olika perioder. Figur 1
visar hur förekomst i nyhetsdelen av korpusen förhåller sig till befolkningsstatistik. I den figuren används
en tabell över förnamn bland folkbokförda respektive decennium, men jämförelser med tilltalsnamn på
nyfödda från tidigare decennier ger liknande resultat: Lars är konsekvent vanligare i nyhetstexterna än i
befolkningsstatistiken och mest accentuerat är detta under 1990-talet. Utifrån sådana data kan man våga
formulera en hypotes om att det var gynnsamt i Sverige att döpas till Lars under 1900-talets mitt. Att
namnet förekommer i seriös tidningspress innebär i de flesta fall att referenten är framgångsrik på något
sätt, det må vara inom sport, politik, kulturliv eller något annat. Dock finns många möjliga felkällor att
beakta: namn i korpusen kan referera till andra företeelser och andra personer än dem som är svenskfödda
i det antagna intervallet, SCB grupperar olika stavningar under ett namn, med mera.
Vi kan undersöka hypotesen vidare genom att titta på samförekomster mellan namn och andra ord i
korpusen. I Tabell 4 visar vi vilka namn som oftast kopplas till titeln professor i delkorpusen från 1990-
talet och vilka namn som ofta uttalar sig (via ordet säger). Som en kontrast kan vi jämföra med vilka
namn som oftast samförekommer med spelar, ett verb som är vanligare inom domäner som idrott och
kultur. Med en parsad korpus skulle sådana undersökningar kunna göras mer uttömmande. Skillnaden
mellan Lars och de andra vanligaste namnen Anders och Peter är kanske mest intressant. De senare spelar
oftare men uttalar sig mindre.
För att se vilka namn som är mest lika Lars kan vi använda ordinbäddningar. Här har jag använt
Word2Vec (Mikolov et al., 2013) i ramverket Gensim för alla sex delkorpusarna.4 Lars-vektorns 10 när-
4https://radimrehurek.com/gensim/index.html
3
Figur 1: Samband mellan förekomst av valda mansnamn i befolkningen och i olika delar av nyhetsdelen
av Gigawordkorpusen. Befolkningsdata är hämtat från SCB:s statistikdatabas, de 100 vanligaste förnam-
nen bland folkbokförda 31 december respektive år för åren 1980–2021.
Namn professor säger/sade spelar/spelade
Lars 140 113 46
Björn 50 43 23
Göran 42 113 15
Jan 39 99 22
Lennart 47 60 13
Bengt 44 49 27
Peter 31 73 61
Anders 25 76 71
Tabell 4: Samförekomster mellan förnamn och valda ord i news1990.
maste grannar noterades i vart och ett av dem. Inget namn förekom i alla sex grannskapen men återkom-
mande var Bengt, 5 ggr, och Jan, Christer, Lennart och Ulf, 4 ggr. Noterbart är att 55 av 60 inbäddningar
representerar mansnamn, att dessa är traditionella svenska namn med antingen biblisk och/eller nordisk
historia. Frånvaron av internationella namn som Peter och Thomas är här påfallande. Kvinnonamnen upp-
visar ett liknande mönster. Mer än 90% av namnen i den närmaste omgivningen av ett givet kvinnonamn
är kvinnonamn.
Mina slutsatser är att stora textkorpusar och språkteknologi visst borde kunna spela en roll i namnso-
ciologin och att det finns en hel del intressanta metodologiska utmaningar.
Referenser
Emilia Aldrin. 2011. Namnval som social handling: val av förnamn och samtal om förnamn bland
föräldrar i Göteborg 2007–2009. Ph.D. thesis, Institutionen för nordiska språk, Uppsala universitet.
Namn och samhälle 24.
Stian Rødven Eide, Nina Tahmasebi, & Lars Borin. 2016. The Swedish culturomics Gigaword corpus:
A one billion word Swedish reference dataset for NLP. InDigital Humanities 2016. From Digitization
to Knowledge 2016: Resources and Methods for Semantic Processing of Digital Works/Texts, Procee-
dings of the Workshop, Krakow, number 126, pages 8–12. Linköping University Electronic Press.
Nikhil Garg, Londa Schiebinger, Dan Jurafsky, & James Zou. 2018. Word embeddings quantify
100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences,
115(16):E3635–E3644.
Tomas Mikolov, Kai Chen, Greg Corrado, & Jeffrey Dean. 2013. Efficient estimation of word represen-
tations in vector space. arXiv preprint arXiv:1301.3781.
4
We may actually all die tomorrow... nevertheless:
Predicting short-term frequency changes in Swedish neologisms
Aleksandrs Berdicevskis Yvonne Adesam
Språkbanken Text, Dept of Swedish, Multilingualism, Language Technology
University of Gothenburg, Sweden
{aleksandrs.berdicevskis,yvonne.adesam}@gu.se
Evie Coussé
Dept of Languages and Literatures
University of Gothenburg, Sweden
evie.cousse@gu.se
Abstract
Predicting the future is difficult, as Lars Borin likes to point out by saying the phrase which is
included in the title of this paper. Nevertheless, we attempt to predict short-term changes in the
frequency of new Swedish words based on some measures of their linguistic and social dissemi-
nation. We show that it is possible to predict the direction of change with a higher-than-baseline
accuracy. Most interestingly, we show that predictions are much less accurate for those words that
denote new phenomena than for those who are new signifiers for already existing phenomena.
1 Introduction
When doing research on language change, linguists usually try to explain the changes, either explicitly
or implicitly, either putting forward hypotheses about causal links or making silent assumptions about
such links. In our view, the most rigorous means to test such hypotheses and assumptions is to attempt
to predict language change.
Predicting the future is notoriously difficult (and annoying: as follows from the title, there is always a
certain risk that the researchers will not be able to evaluate their own predictions). Fortunately, there is
an easier to way to evaluate the predictive power of a theory: splitting the data into a seen and an unseen
set, training a theory-based model on the former one and testing it on the latter.
In this paper, we use this approach in order to explore whether short-term frequency changes of
Swedish neologisms can be predicted from corpus data. Since neologisms are likely to experience such
changes (either an increase, if they become established in the language, or a decrease if they fail to do
so), they are a favourable ground for this kind of predictions.
We focus on comparing how successful predictions are for words that denote new phenomena and
those that are new signifiers for already existing phenomena. We hypothesize that the former would be
more difficult to predict, since they are more dependent on language-external events and not on linguistic
and sociolinguistic processes, which we can hope to capture by our measures. Our results support this
hypothesis.
2 Data
Språkrådet, the Swedish language council (in the last decade together with the magazine Språktidningen),
releases a list of new words every year.1 The words on this neologism list are supposed to have come into
use, or gained in use, in the last year. The final list is in no way a complete (if there could even be such
1https://www.isof.se/stod-och-sprakrad/spraktjanster/nyordslistor.
Aleksandrs Berdicevskis, Yvonne Adesam and Evie Coussé. 2022. We may actually all die
tomorrow... nevertheless: Predicting short-term frequency changes in Swedish neologisms. In
Volodina, Dannélls, Berdicevskis, Forsberg and Virk (editors), Live and Learn – Festschrift in
honor of Lars Borin, pages 5–11. Available under CC BY 4.0 5
a thing) list of new words, but rather a list of words that the compilers consider especially interesting or
telling about the past year (Karlsson, 2021).
Our work starts with these neologism lists for the past twenty years, 2003–2021. The main reasons
for using them are that they are readily available, and a well-known part of Swedish linguistic debate.
While the lists are based on expert knowledge, one of the weaknesses for our purposes is their selection
procedure. Words are picked in part based on their societal relevance, but mostly, which may not be
as relevant for us, with the purpose of language cultivation, to show current trends and the breadth in
patterns of word formation and language creativity (Karlsson, 2021).
The restriction in time depends on the corpora that we use. This particular study primarily reports on
data from Flashback, while also consulting Familjeliv in the initial stages of the data extraction process.
Both corpora represent very large Swedish discussion forums covering a broad range of topics. With more
than eight billion tokens, we expect the corpora to be large enough to show new words from the onset,
a point when they still will be very infrequent. We assume that the language in this material is closer to
everyday use and less edited than many other types of text, which may mean that non-normative language
is more frequent, or shows up earlier than in edited texts such as news. All corpora are provided through
Språkbanken Text and its corpus infrastructure Korp (Borin et al., 2012).2
We first extracted all words from the neologism lists, removing multi-word units for ease of searching.3
We then gathered statistics about frequencies for these words in the discussion forums Flashback and
Familjeliv, by matching each word from the list to its lemma, and in case the lemma is not available, to
the word form directly. Not all words in the corpora have lemma annotated, e.g. because no match for
the word can be found in the Saldo lexicon (Borin et al., 2013). It should also be noted that not all words
in the neologism lists are in their base form.
Many words in the full neologism list have few, if any, instances in the corpora, and will thus be
difficult to track over time. We therefore selected the words with the highest frequencies from this list.
For each word we added so-called lemgrams, lexicon identifiers from the Saldo lexicon, which give us
all inflected forms. For words not in the lexicon, we manually added all inflected forms.
In this process, we also removed a number of words which would give us too many erroneous matches
in the corpora, for example VAR ‘video assistant referee’, which happens to coincide with the verb form
var ‘was’ and the wh-word var ‘where’, or manga ’Japanese comics’, which turned out to be mostly
instances of a misspelled många ‘many’. For the same reason, we also excluded words which were not
new themselves, but acquired a new meaning, if this meaning was relatively infrequent compared to the
previous meaning(s); for instance, spår ‘education track’ (general meaning: ‘track’). We kept the words
where the situation was the reverse: the frequency of the older meaning(s) was small; e.g. buda ‘to make
a bid’ (older meaning: ‘to send with a courier’).
Some words in the list were spelling or pronunciation variants, e.g. babybio and bebisbio ‘adapted
movie showings for parents with babies’ and these were joined into one item in our list. For other entries
we added spelling variants, e.g. covid and covid19 to covid-19. We did not merge words that are derived
from the same stem, such as blogg ‘a blog’, blogga ‘to blog’, bloggare ‘blogger’.
In the end, we had a list of 75 words, which all had absolute frequencies of more than 2300 in Flash-
back and Familjeliv taken together, see Appendix A. We labelled each word in this list as either denoting
a new phenomenon (e.g. covid-19) or being a new signifier for a previously existing phenomenon (e.g.
buda ‘to make a bid’; prio ‘a priority’, a shortening from prioritet). A new phenomenon is one which
is (relatively) new for most part of Swedish society (e.g. anime ’Japanese comics’ is not new as phe-
nomenon in Japan, nor, perhaps, is it among its early fans in Sweden, but it was largely unknown to
the mainstream public in Sweden before 2003). Dealing with numerous borderline cases, we tried to
establish whether a change occurs in the language (and discourse) or in the material world. This process
has been further complicated by the fact that in many cases either the signifier, or the phenomenon, or
both, are not actually new, but experienced a substantial increase in frequency. In total, we end up with
42 “new phenomena” and 33 “new signifiers”.
2spraakbanken.gu.se/korp
3Multi-word expressions may be explored in future research.
6
Figure 1: A visualization of our method for the word dampa ‘freak out’. Large filled squares represent
those months that are in the test set and for which the direction of change has to be predicted (from the
previous month), small filled circles represent the months for which the direction of change has been
correctly predicted by a randomly chosen model.
3 Methods
Stewart & Eisenstein (2018) show that changes in the frequency of new words on English reddit can
rather successfully be predicted using measures of linguistic dissemination and social dissemination, of
which the former are better predictors. We try to reproduce their success using a similar methodology.
Our task is to predict for a given word whether its frequency (normalized by corpus size, hereafter rela-
tive frequency) will decrease or not in month n+1, given the information about month n. Not decreasing
means that the frequency either increases or stays the same. We try six predictors: relative frequency,
two measures of linguistic dissemination (number of unique trigram contexts in which the word occurs,
number of unique part-of-speech trigram contexts) and three measures of social dissemination (number
of unique users, number of unique threads and number of unique subforums).
Using absolute numbers (e.g. count of unique trigram contexts) as measures of dissemination can be
problematic, since they, obviously, are all strongly correlated with absolute frequency. We follow the
solution proposed by Stewart & Eisenstein (2018): for every month, we fit a linear regression model
between the given predictor and the absolute frequency (for all words in the dataset) and then take the
residuals (i.e. the proportion of variance which is not explained by the absolute frequency) as a measure
of dissemination. Hopefully, this procedure also mitigates the problem that corpus size varies strongly
with time.
All our datapoints are tuples that look as follows: data for word A at month n, data for word A at
7
predictors rank n-s rank n-p perf. n-s perf. n-p
1 18 18 0.007 -0.030
rel. freq. 3 9 0.121 0.030
authors + threads + subforums + rel. freq. 1 2 0.134 0.045
authors + threads + subforums + rel. freq. +
2 1 0.124 0.054
+ trigrams + pos trigrams
Table 1: Performance (increase in accuracy over baseline) across various models across the two neolo-
gism types: n-s (new signifier) and n-p (new phenomenon). Rank shows the place of the model in the
list of all models ranked by performance.
month n+1. For every word, we randomly split all datapoints into a training and test set (80:20). Since
predictions are always based on a previous month only, there is no point in preferring a chronological split
to a random one. We fit a logistic regression model on a training set and then evaluate its predictions on
a test set by subtracting the baseline accuracy (achieved by always predicting the more frequent outcome
in the training set) from the actual accuracy. The maximum possible value of model performance is thus
0.5, the minimum value is -1. We fit 18 different models, one of them a null model (the predictor is a
vector of ones), the rest are various combinations of the six aforementioned predictors that we find most
promising. Our method is visualized on Figure 1 for the word dampa ‘to freak out’.
4 Results and discussion
The performance (measured as increase in accuracy over the baseline) varies substantially across models,
but for all models, it is better for the new-signifier words than for new-phenomenon ones. The differences
range from 0.037 to 0.088.
For brevity’s sake, we report the performance of four models: the best one for new-signifier words; the
best one for new-phenomenon words; the worst one, which is the same for both types (the null model);
and, for comparison, the one which uses only relative frequency as a predictor. See Table 1.
The null model performs worse than all other models, which is expected. It is interesting to see, how-
ever, that even for this model there is a difference between the new-phenomena words and the new-
signifiers words, which suggests that the difference in performance depends not so much on the predic-
tors that we choose, but rather on certain properties of the trends themselves. Another interesting result
is that frequency alone can account for a substantial part of the predictions’ success.
While it is tempting to draw further conclusions about the relative importance of different predictors
and compare them with previous work (Stewart & Eisenstein, 2018; Würschinger, 2021), we refrain from
doing so at this exploratory stage of our study. Additional pilot analyses (not reported here) suggest that
small changes in the experimental setup have the potential to strongly affect how the models perform with
respect to each other, implying that these results may not be robust. The main result (better performance
for new-signifier words), however, remains robust.
In order to test whether the difference in performance between the new-signifier words and the new-
phenomena words is an artifact, we perform two comparisons. First, we compare whether the proportion
of increases in frequency is approximately the same for the types, and that turns out to be the case (0.56
for new phenomena, 0.54 for “new signifiers).
Second, we compare the average amount of datapoints (months) per word. This amount varies (and is
smaller the later words appear). It is reasonable to assume that with more datapoints the performance is
likely to go up, and indeed that seems to be the case: there is a positive correlation between the amount
of datapoints per word and the performance of the best model on the word (Pearson’s coefficient is 0.52
for new phenomena and 0.58 for new signifiers). The average amount of months per word is slightly
smaller for new phenomena (193 vs. 218). If we remove the two words with the smallest amount of
datapoints (covid and vaccinpass, resp. 23 and 19), the average amount for new phenomena goes up to
8
202, the qualitative results do not change. If we remove six words with the largest amounts of datapoints
from new signifiers and six with the smallest amount from new phenomena, the average amounts become
equal, but the results still hold for all model but one.
If we exclude the 11 words which we found particularly difficult to label as either new phenomena or
new signifiers (marked with asterisks in Appendix A) the qualitative results do not change.
Our next goal is to test further which of the findings are robust, and for those that are, to explain why
these effects emerge. We hope to do that, but who knows –
Acknowledgements
This work was supported by the Cassandra project (Marcus and Amalia Wallenberg Foundation,
2020.0060) and by the Swedish national research infrastructure Nationella språkbanken, funded jointly
by the Swedish Research Council (20182024, contract 2017-00626) and the 10 participating partner
institutions.
References
Lars Borin, Markus Forsberg, & Johan Roxendal. 2012. Korp the corpus infrastructure of Språkbanken. In
Proceedings of LREC 2012. Istanbul: ELRA, volume Accepted, pages 474–478.
Lars Borin, Markus Forsberg, & Lennart Lönngren. 2013. SALDO: a touch of yin to WordNets yang. Journal of
Language Resources & Evaluation, 47:1191–1211.
Ola Karlsson. 2021. Lesserwisser, lårskav och läppstiftseffekt presentation och problematisering av urvalskri-
terierna för den svenska nyordslistan. Lexico Nordica, 28:101–120.
Ian Stewart & Jacob Eisenstein. 2018. Making “fetch” happen: The influence of social and linguistic context on
nonstandard word growth and decline. In Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing, pages 4360–4370, Brussels. Association for Computational Linguistics.
Quirin Würschinger. 2021. Social networks of lexical innovation. investigating the social dynamics of diffusion
of neologisms on twitter. Frontiers in Artificial Intelligence, 4.
Appendix A
# word translation freq year type
1 blogg blog 462799 2004 phenomenon
2 googla [to] google 445681 2003 phenomenon
3 app app 221612 2010 phenomenon
4 covid † covid 144059 2020 phenomenon
5 foliehatt tin foil hat 49073 2011 signifier *
6 buda [to] bid 33745 2006 signifier
7 svininfluensa swine flu 33240 2009 phenomenon
8 wiki wiki 32726 2007 phenomenon
9 blogga [to] blog 31606 2005 phenomenon
10 stalker stalker 27829 2003 signifier
11 följare online follower 24638 2009 phenomenon
12 curla [to] be a helicopter parent 22126 2006 signifier
13 twittra, kvittra † [to] tweet 21159 2009 phenomenon
14 bloggare blogger 18547 2005 phenomenon
15 lockdown lock down 18246 2020 signifier *
16 prio priority 18180 2006 signifier
17 padda tablet computer 18161 2011 phenomenon
18 näthat online hate 17655 2007 phenomenon
19 haffa [to] pick up (while dating) 17253 2015 signifier
9
20 hypa [to] hype 17050 2013 signifier
21 #metoo, metoo #metoo 15884 2017 phenomenon
22 stalka [to] stalk 15430 2003 signifier
23 matkasse online groceries 14352 2011 phenomenon
24 menskopp menstrual cup 13003 2005 phenomenon *
25 sprita [to] disinfect hands 11913 2009 signifier *
26 fronta [to] be/put in front 11541 2004 signifier
27 vaccinpass vaccination certificate 11480 2021 phenomenon
28 incel incel 9925 2018 phenomenon
29 brexit brexit 8532 2013 phenomenon
30 anime anime 8333 2003 phenomenon *
31 barnvagnsbio † cinema for parents with babies 8170 2003 phenomenon
32 åsiktskorridor opinion corridor 7963 2014 signifier
33 stalking, stalkning stalking 7930 2003 signifier
34 selfie selfie 7835 2013 phenomenon
35 digitalbox digital tv box 7620 2005 phenomenon
36 topsa [to] swab 7449 2004 signifier *
37 spikmatta bed of nails 7236 2009 phenomenon *
38 sporta [to] sport, show off 7067 2009 signifier
39 trängselskatt congestion taxes 6808 2005 phenomenon
40 pimpa [to] pimp 6714 2007 signifier
41 framåtlutad energetic 6391 2011 signifier
42 klimathot climate change 6366 2007 phenomenon
43 sars SARS 6309 2003 phenomenon
44 hedersvåld honor-related violence 6250 2005 signifier
45 entourage entourage 6120 2007 signifier
46 foppatoffel crocs 5799 2007 phenomenon
47 vintage vintage 5663 2007 signifier
48 EU-migrant EU migrant 5657 2015 signifier *
49 svinna [to] waste 5344 2021 signifier *
50 hbt lgbt 5068 2004 signifier *
51 halmdocka straw man argument 4878 2015 signifier
52 chippa [to] chip 4439 2009 phenomenon
53 transfett trans fat 4363 2007 phenomenon *
54 e-sport e-sports 4078 2013 phenomenon
55 snackis hot conversation topic 4002 2005 signifier
56 dampa freak out 3624 2007 signifier
57 transponder transponder 3612 2005 phenomenon
58 rondellhund roundabout dog 3362 2006 phenomenon
59 instegsjobb entry-level job 3336 2004 phenomenon
60 bröllopsklänning wedding dress 3275 2011 signifier
61 trollfabrik troll factory 3210 2015 phenomenon
62 kubtest prenatal test 3098 2007 phenomenon
63 videosamtal video call 2989 2004 phenomenon
64 cringe cringe 2955 2017 signifier
65 svischa † [to] transfer money via Swish 2923 2015 phenomenon
66 curlingförälder helicopter parent 2914 2004 signifier
67 youtuber youtuber 2878 2015 phenomenon
68 nätpoker online poker 2850 2005 phenomenon
69 blingbling bling-bling 2659 2004 signifier
70 nystartsjobb entry-level job 2599 2006 phenomenon
10
71 klimatsmart climate friendly 2558 2007 phenomenon
72 livspussel work-life balance 2496 2007 signifier
73 killgissa [to] guess, mansplain 2356 2017 signifier
74 backslick backslick hairdo 2321 2004 signifier
75 skypa, skajpa [to] skype 2306 2007 phenomenon
Table 2: The 75 selected words with their year of appearing in Språkrådet/Språktidningen’s list, their
frequency in the Familjeliv and Flashback discussion forum corpora and approximate translations. The
words are marked as either new signifier or new phenomenon. Asterisk is used for borderline cases.
Dagger indicates that some spellings or other variants of the word that we included in the search are not
listed in the table.
11

avokado-r/-er/-s/-sar
Kristian Blensenius Louise Holmer
Inst. för svenska, flerspråkighet Inst. för svenska, flerspråkighet
och språkteknologi och språkteknologi
Göteborgs universitet, Sverige Göteborgs universitet, Sverige
kristian.blensenius@gu.se louise.holmer@svenska.gu.se
Abstract
The article discusses lexicographic perspectives of the Swedish plural with the suffix -s. Tra-
ditionally, plural nouns ending in -s, for example avokados ‘avocados’, are considered colloquial
speech; the formal way of writing the plural in question is avokador or avokadoer. However, since
the Swedish Academy grammar included a noun declension indicating plurals with the suffix -s,
plural with -s seems to have become more accepted, at least among language planners.
1 Inledning
När Svenska skrivregler utkom i ny upplaga 2017 (Karlsson, 2017) var en uppmärksammad nyhet att
Språkrådet nu godkände s-plural, att s-pluralen så att säga hade blivit officiell. På Sveriges radios webb-
plats meddelade Vetenskapsradions nyheter 2017-03-22 att ”[de svenska skrivreglerna] öppnar […] för
användning av plural-s i svenskan”, och Svenska Dagbladet publicerade 2017-03-27 en artikel rubricerad
”Engelskt plural-s på väg in i svenska ordböcker”. Förändringen i förhållande till den tidigare upplagan
(Svenska skrivregler, 2008) var nu inte oerhörd, men formuleringar som att den främmande s-pluralen
var ”olämplig” hade plockats bort. Det angavs dock fortfarande att det i regel är bäst att undvika s-plural
i formella texter.
Trots att s-plural har använts i svenskan sedan 1700-talet (i vissa fall tidigare; se Söderberg, 1983),
verkar fenomenet ännu inte riktigt ha släppts in i värmen. När Svensk ordbok utgiven av Svenska Aka-
demien gavs ut i reviderad upplaga 2021 (SO, 2021) innehöll den emellertid fler ord pluralböjda med -s
än tidigare, t.ex. sambo, med pluralangivelsen ”sambos eller sambor” (tidigare endast sambor). S-plural
gavs också som förstaform för en del nyinlagda ord, t.ex. hashtags eller hashtaggar.
I Svenska Akademiens ordlista, 14 uppl. från 2015 (SAOL 14), är situationen en annan: här är (engelsk)
s-plural uttryckligen motarbetad, bl.a. av den anledningen att denna pluralform anges inte passa in i det
svenska böjningssystemet (se inledningen till den tryckta SAOL 14, s. XI). För ett ord som sambo ges
endast r-pluralen sambor i SAOL, medan ett annat (icke-engelskt) ord som avokado endast ges pluralen
avokador (utöver att endast k-stavningen ges). Detta trots att pluralformer som avokados numera inte
är ovanliga i tal- och skriftspråk, liksom – men ovanligare – avokadosar. Den senare, som ibland går
under beteckningen sar-plural (Josefsson, 2018), behandlas översiktligt i Svenska Akademiens grammatik
(SAG; Teleman et al., 1999, vol. 2, s. 83, 104), förekommer i former som bikinisar och, kanske i mer
lustfyllda sammanhang, trefaldigt markerade pluraler som paparazzisar (den rekommenderade formen är
paparazzoer). Språkvårdare avråder vanligen från -sar-plural i vårdat skriftspråk, och särskilt trefaldigt
betecknade har varit ställda utanför gemenskapen. Ett parallellt fall är italienskheten putto, för vilket
pluralformen puttisar av Wellander (1970, s. 162) beskrivs som ett ordformsexemplar som ”knappast
kan anses som en prydnad för vårt språk”.
Som ett led i arbetet med vidareutvecklingen av SAOL och SO undersöker vi hur s-pluralformer be-
handlas i de två svenska enspråkiga ordböckerna SAOL 14 och SO 2021, och i föreliggande text ger vi
exempel på lånord där skriftspråket uppvisar variation i fråga om pluralböjning, framför allt avseende
Kristian Blensenius and Louise Holmer. 2022. avokado-r/-er/-s/-sar. In Volodina, Dannélls,
Berdicevskis, Forsberg and Virk (editors), Live and Learn – Festschrift in honor of Lars Borin,
pages 13–16. Available under CC BY 4.0 13
s-plural. Vi vill jämföra rekommendationer som uttrycks av framför allt Språkrådet, liksom grammatiska
beskrivningar av s-plural i SAG, med språkbruket vid ett urval av ord där pluralformerna sedan tidigare
har konstaterats variera.
Vi fokuserar på substantiven avokado, bikini och hashtag. De två förra är välkända i svenskan sedan
åtminstone 1930- och 1940-talet, medan hashtag är ett exempel på ett nyare ord, belagt i skrift (nyhetstext)
sedan 2010.
2 Beskrivningar i grammatik och språkvård
Här ges en genomgång av vad som sägs om s-plural i referensgrammatiken SAG, några grammatiska
läroböcker samt Språkrådets rekommendationer.
När SAG utkom 1999, tillkom i en svensk grammatisk beskrivning en sjunde deklination (SAG 2,
s. 63), innefattande substantiv med pluralsuffixet -s, t.ex. dissenters och tricks. Att en omfattande refe-
rensgrammatik för svenska inkluderat deklinationen har från språkvårdshåll kommit att ge viss legitimitet
åt användningen av s-plural för ord som avokado, bikini och hashtag: avokados, bikinis respektive hash-
tags.
S-pluralen har dock inte med självklarhet fått genomslag i grammatikläroböcker på högskolenivå se-
dan publiceringen av SAG 1999: vissa har anammat s-pluraldeklinationen, t.ex. Bolander (2012, s. 114)
medan andra inte verkar ta upp den särskilt, t.ex. Lundin (2014, s. 187). Den till SAG relativt nära knutna
Svenska Akademiens språklära (Hultman, 2003, s. 64–65) antar bara sex deklinationer, och även om s-
pluralen nämns antas inte någon särskild deklination för den. Även Josefsson (2009) diskuterar s-pluralen,
här som en möjlig sjätte deklination (Josefsson räknar med fem grunddeklinationer), men uttrycker sam-
tidigt tveksamhet, utifrån resonemanget att s-pluralen bara förekommer under en övergångstid, för att
därefter pluralböjas enligt någon av de andra deklinationerna.
Svenska skrivregler i sin senaste upplaga kan sägas gå ett par steg längre än SAG i fråga om vad som
antas om pluralformens etableringsgrad. SAG (2, s. 79) noterar att ”Bruket av -s har i de flesta fall en
relativt osvensk prägel”, medan Svenska skrivregler (Karlsson, 2017, s. 104) menar att ”S-plural är ganska
vanligt förekommande i svenskan” och ”För vissa ord är […] s-pluralen så etablerad att den åtminstone i
vissa sammanhang dominerar helt över andra böjningsmönster”. Språkrådet går i sin ”Frågelådan” ännu
något längre: exemplifierandemed plural av några svenska ord, t.ex. sambos och skämtsamma former som
snyggos, gubbs och kvinns, meddelas att s-plural är att betrakta som en del av det svenska språksystemet.1
Svenska skrivregler anför förvisso ett vanligt argument emot s-pluralen: böjningsmönstret ”saknar
en etablerad form för bestämd form plural” (Karlsson, 2017, s. 104). Tanken verkar vara att s-plural-
deklinationen ska hålls intakt (s-suffixet i obestämd form plural ska följamed även i den bestämda formen)
och att sar-plural ska undvikas, åtminstone i formell text. Sådana resonemang kan skönjas i språkvårdares
rekommendationer av andra pluralsuffix av typ flera containrar – de containrarna.
3 S-plural i SAOL, SO och bruket
Substantivet hashtag har som nämnts en mycket begränsad historia i SAOL och SO, men avokado och
bikini har varit med desto längre och ibland försetts med olika rekommendationer avseende böjningssätt
och böjningsformer över tid (jfr Josefsson, 2009).
När ordet avokado togs med i SAOL 10 (1973) noterades, förutom variantstavningen avokato, den enda
pluralböjningen avokador. De följande upplagorna ger även pluralformen avokadoer, medan s-pluralen
avokados som nämnts ännu inte har tagits in i SAOL.
Uppslagsordet bikini togs också in i SAOL 10 (1973), dåmed pluralböjningen bikini, alltså samma form
som i singular, och i SAOL 11 (1986) fanns också variantböjningen bikinier med. I SAOL 14 (2015) har
pluralformen bikini försvunnit och ersatts av böjningsangivelsen ”bikinier hellre än bikinis”, där beteck-
ningen ”hellre än” indikerar att bikinier förordas framför bikinis.
Medan SAOL 14 är mer normativ, är SO 2021 mer deskriptiv (Blensenius et al., 2021, s. 41). De olika
inriktningarna i fråga om normativitet visar sig genom att SAOL 14 har fler och explicitare rekommen-
1Språkrådet, Frågelådan. ”Hur ser Språkrådet på s-plural?” https://frageladan.isof.se/visasvar.py?svar=79712.
Hämtat september 2022
14
dationer än SO 2021 (t.ex. rekommendationer som den för attachment: ”Använd hellre bilaga”). Det är
därför ingen överraskning att den mer deskriptivt inriktade SO 2021 ger s-plural för avokado, avokados,
medan den mer normativa SAOL 14 inte rekommenderar avokados. Noteras kan att varken SAOL eller
SO föreslår eller på annat sätt nämner sar-plural som möjlig form.
Pluralformer för avokado, bikini och hashtag ges på följande sätt i de båda ordböckerna: SAOL 14
ger ”avokador”, ”bikinier hellre än bikinis” respektive ”hashtaggar hellre än hashtags”, medan SO 2021
ger ”avokados eller avokador”, ”bikinis” respektive ”hashtags eller hashtaggar”. Notera att SO 2021
jämställer böjningsvarianterna med beteckningen ”eller”.
I skriftspråket varierar pluralböjningen av avokado mer än i ordböckerna. I olika typer av texter på-
träffas främst dessa: avokado, avokados, avokador, avokadoer, avokadon och avokadosar. Av dessa är
avokador den mest frekventa i tidningstext, ungefär 10 gånger vanligare än avokados (vi bortser från k/c-
variationen, och vi bortser också från potentiell homografi). I fråga om bikini är pluralvariationen inte
lika stor, men det är ingen tvekan om att pluralformen bikinis är betydligt vanligare än den i SAOL 14 re-
kommenderade bikinier. I fråga om hashtag utgör hashtags den vanligare pluralformen i obestämd form,
medan bestämd form har klar övervikt för hashtaggarna jämfört med t.ex. hashtagsen i tidningstext.
4 Saxad böjning eller Frihet att röra sig mellan paradigmen
Mycket talar för att båda pluralformerna av ord som hashtag (-s och -ar) bör inkluderas i ordböckerna.
Språkbrukarna kan också lösa svårigheten med bestämd form plural genom att använda hashtags i
obestämd form och hashtaggarna i bestämd form. Vi föreslår möjlighet till s.k. saxad böjning (SAG 2,
s. 544), dvs. böjning där böjningsformerna för samma substantiv kan föras till olika deklinationer. Här
skulle man då kunna tänka sig denna böjning:
singular plural obest. plural best.
avokado avokados avokadorna
bikini bikinis bikinierna
hashtag hashtags hashtaggarna
Sammanfattningsvis försöker vi illustrera hur pluralerna behandlas på delvis olika sätt i grammati-
kor och ordböcker och av språkvårdare och språkbrukare. Trots allt är det språket i bruk som ligger till
grund för både ordböckerna och grammatikböckerna, samt språkvårdens olika rekommendationer, och
frågan är om inte SAOL behöver anamma en mer tillåtande attityd till s-plural i kommande upplagor.
Referenser
Kristian Blensenius, Louise Holmer, & Emma Sköldberg. 2021. SAOL 14 som rättesnöre – diskussion kring den
senaste upplagan. LexicoNordica, pages 39–58.
Maria Bolander. 2012. Funktionell svensk grammatik (3 upplagan). Liber, Stockholm.
Tor G. Hultman. 2003. Svenska Akademiens språklära. Svenska Akademien, Stockholm.
Gunlög Josefsson. 2009. Svensk universitetsgrammatik för nybörjare (2 upplagan). Studentlitteratur, Lund.
Gunlög Josefsson. 2018. Avokadosar och kepsar – ett epentetiskt s med olika funktioner. Språk och stil, 28:5–21.
Ola Karlsson, editor. 2017. Svenska skrivregler (4 upplagan). Språkrådet & Liber, Stockholm.
Katarina Lundin. 2014. Tala om språk. Grammatik för lärarstuderande. Studentlitteratur, Lund, 2 edition.
SAOL 14. 2015. Svenska Akademiens ordlista över svenska språket (14 upplagan). Tillgänglig: svenska.se.
Hämtat september 2022.
SO. 2021. Svensk ordbok utgiven av Svenska Akademien (2 upplagan). Tillgänglig: svenska.se. Hämtat septem-
ber 2022.
Barbro Söderberg. 1983. Från rytters och cowboys till tjuvstrykers. S-pluralen i svenskan. En studie i språklig
interferens. Almqvist & Wiksell International, Stockholm.
15
Ulf Teleman, Erik Andersson, & Staffan Hellberg. 1999. Svenska Akademiens grammatik. Svenska Akademien,
Stockholm.
Svenska skrivregler. 2008. (3 upplagan). Språkrådet & Liber, Stockholm.
Erik Wellander. 1970. Riktig svenska. En handledning i svenska språkets vård. Norstedts, Stockholm.
16
Counting dirty words:
The effect of OCR quality on token statistics in historical Swedish corpora
Gerlof Bouma and Yvonne Adesam
Språkbanken Text / Dept of Swedish, Multilingualism, Language Technology
University of Gothenburg, Sweden
{gerlof.bouma,yvonne.adesam}@gu.se
Abstract
We explore the effects of varying OCR quality on word statistics in historical Swedish newspaper
corpora. Most type frequencies are underestimated, with a small group of overestimated types. To
adjust the freqencies we propose using the strong correlation between lexicon coverage and word
error rate, which shows encouraging first results. The method currently, however, only targets
underestimation and needs further development.
1 Introduction and background
The large scale corpora provided through Språkbanken Text invite exploration of a wide range of ques-
tions, both linguistic and from other disciplines. One fruitful method that Språkbanken’s corpus infras-
tructure makes easily available to researchers is the comparison of term occurrences between different
corpora, for instance between collections from different time periods, from different genres, with different
target audiences, or different types of text producers.
A valid comparison across corpora relies on the assumption that contrasts between corpora are due
to factors inherent to the texts, preferably the factors of interest. However, different corpora have dif-
ferent levels of reliability and quality. First, there may be problems with the (automatic) annotations.
For instance an early 2000 newspaper corpus is likely to receive better annotations than a 19th century
newspaper or a 2020 blog entry, as the latter are textually further away from the material used to train
the annotation tools. Secondly, the faithfulness of the available electronic version to its source may show
faults. We can expect a corpus based on born digital newspapers from the 2000s to have few issues
in this respect, but one based on material from before the digital age, which must be digitized – pho-
tographed, analysed for layout and text flow, and put through optical character recognition – may deviate
from its source in many ways, and also contain errors that percolate down the processing pipeline. For
ever-earlier historical newspapers, these issues will be exacerbated by the state/quality of the paper, the
printing techniques, the script, and changes in conventions and language, to name but a few factors.
The impact of varying OCR quality on text analysis and processing has been studied by Hill &
Hengchen (2019), and van Strien et al. (2020), among others. The current paper focuses on one spe-
cific aspect of this impact: the effect of OCR accuracy on word statistics. Suppose we look for the word
politik ‘politics’ in a corpus of 1M tokens, and find 100 occurrences. We would say it has a relative
frequency of 0.1%. But if we also know that half of these 1M tokens are letter salad – garbled words –
we might prefer to say we found 100 occurrences in 500k tokens instead – twice the relative frequency.
Below, we will have an empirical look at the relation between word frequencies and OCR quality. Fur-
thermore, by using the proportion of known words in a document as a proxy for its OCR quality (Adesam
et al., 2019; van Strien et al., 2020; Neudecker et al., 2021, and references therein), we will investigate a
simple method for adjusting frequency estimates to correct for the error introduced by OCR mistakes.
Gerlof Bouma and Yvonne Adesam. 2022. Counting dirty words: The effect of OCR quality on
token statistics in historical Swedish corpora. In Volodina, Dannélls, Berdicevskis, Forsberg
and Virk (editors), Live and Learn – Festschrift in honor of Lars Borin, pages 17–24. Available
under CC BY 4.0 17
2 Material and method
As our corpus, we use the historical newspaper dataset described in Dannélls et al. (2021), available from
Språkbanken Text,1 which contains 186 pages of newspaper text, distributed over 90 documents (that
is, newspaper editions) published between 1818 and 1906, with an additional late 20th century newspa-
per included for control. The data is a selection of the National Library of Sweden’s digital newspaper
archive,2 and consists of images, ground truth transcriptions (henceforth: GT) and the output of several
OCR systems. We use the included output of the commercial ABBYY Finereader 11 as our OCR text.
We visually classified newspaper editions as Blackletter, Antiqua or mixed3 using the supplied images.
Blackletter is the dominant script in 44 documents, Antiqua in 38, and 8 documents are mixed.
We lightly preprocessed both the GT and OCR material by (blindly) undoing end-of-line hyphenation
and replacing ſ (long s) by s, after which we ran the materials through the Sparv annotation pipeline
using an annotation module for historical Swedish4 (Hammarstedt et al., 2022). Sparv provides us with
tokenization and links from word forms to entries in one of three lexical resources: one consisting of
entries (base forms) in Swedberg’s early 18th c dictionary (Swedberg & Holm, 2009), a resources based
upon Dalin’s early 19th c dictionary (Dalin, 18501853) with full paradigms for part of the entries,5 and
the present-day Swedish lexical resource Saldo (Borin et al., 2013), which contains a comprehensive
full-form component. As we were not interested in the quality of the links, but in estimating the maximal
coverage of the dictionaries, we configured Sparv to assign as many links as possible. We also seperately
postprocessed tokens containing ß, since the earliest dictionary contains this ligature as is, whereas later
use ss in its place. For our statistics we only consider word-like tokens, and therefore discard punctua-
tion.6 The GT and OCR materials consist of around 500k words, each.
One of the quantities of interest in this study is the proportion of known word-like tokens, that is, those
which received at least one dictionary link. Existing research (van Strien et al., 2020) shows that this
is correlated to OCR quality. Since we can add the needed annotation automatically, the proportion of
known words gives us a handle on OCR quality without the need for demanding manual transcription. To
evaluate the validity of this proxy, we also perform an intrinsic evaluation of OCR quality by comparing
GT and OCRed documents at word level, using normalized word error rate (NWER). This measure is
based upon the alignment of token sequences, in our case sequences forming a paragraph-like segment,
as defined in the used dataset. If we have four alignment operations insert a word, delete a word, replace
a word by another word, and match a word to an identical one, NWER is given by:
# error operations # insertions+ # deletions+ # replace operations
NWER = =
# all operations # insertions+ # deletions+ # replace operations+ #matches
We ignore differences in case and treat ß and ss as the same substring when comparing two words.
3 Results
3.1 OCR quality
Figure 1 contains the results of looking at OCR quality over time, both using NWER (a) and using the
proportion of known tokens (b).7 The NWER results clearly shows later editions are OCRed more accu-
rately, although curiously, the effect is only really seen in the Blackletter material. The error rate in the
1https://spraakbanken.gu.se/resurser/svenska-tidningar-1818-1870, https://spraakbanken.gu.se/
resurser/svenska-tidningar-1871-1906
2The archive is accessible from tidningar.kb.se. The historical part is available as the “Kubhist” corpus at https:
//spraakbanken.gu.se/korp/?mode=kubhist
3We classified as mixed newspaper samples that contain sections in Blackletter as well as sections in Antiqua, for instance
when the former script was used for news and the latter for classified advertisements or a feuilleton. Older Blackletter newspapers
also frequently contain smaller stretches of Antiqua, for instance for French or Latin words/text, but we did not consider this to
be mixed script material.
4https://spraakbanken.gu.se/sparv/
5Notably, however, there are no inflected forms for some irregular high frequency items, such as the verb vara ‘be’.
6To be precise, we only look at tokens matching the regular expression \w+([-:']\w+)*, where \w is an alphabetical
character or a number.
7The smoothed trendlines are produces by the loess function in R v4.2.1, with the standard settings (R Core Team, 2022).
18
80% Blackletter/Mixed 78% Antiqua Blackletter
Mixed
Antiqua
1825 1850 1875 1900 1975
Publication date
(a) OCR quality by publication date
Historical Blackletter
Present day Mixed
Combined Antiqua
1825 1850 1875 1900 1975
Publication date
(b) Coverage of dictionaries by publication date in OCRed data
Historical Blackletter
Present day Mixed
Combined Antiqua
1825 1850 1875 1900 1975
Publication date
(c) Coverage of dictionaries by publication date in ground truth data
Figure 1: The relation between publication date of the newspaper and different aspects of OCR quality.
Intrinsic evaluation using GT data shows newer publications are interpreted more accurately (a), although
the effect is only clear for publications using Blackletter. Concordantly, newer material generally contains
higher proportions of words recognized by the dictionaries (b), although comparison to measurements on
ground truth material reveals that part of this uppward trend is explained by the better compatibility of
the large present day dictionary with newer language material (c).
19
Prop. of known tokens Prop. of known tokens Normalized word error ratio
0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
78% Blackletter/Mixed 85% Antiqua
Blackletter
Mixed
Antiqua r = –0.925
0.0 0.2 0.4 0.6 0.8 1.0
Proportion of known tokens
Figure 2: Dictionary coverage is highly predictive of OCR quality.
Antiqua material is comparatively stable from the earliest newspapers up to 20th century control edition.
The proportion of known tokens gives a similar, though negated trend. The negation is because a high
proportion of known tokens goes together with a low error rate. But given this transformation, the overall
shape of the graphs are very similar, including the wavy area around 1860. Generally, later material is
of better quality. The relation between NWER and proportion of known tokens is plotted explicitly in
Figure 2. The high correlation shows that probing OCR quality using dictionary coverage is feasible. We
do note that the proportion of known tokens falls more slowly than the error ratio increases.
Figure 1c shows that the proportion of known tokens also grows over time for the GT data (green points
and trend line). The data set’s newer material is more like the present-day Swedish we find in Saldo’s full
form lexicon. This explains why we see the coverage improvement in graph for the modern dictionary
(in blue) but not for the historical dictionaries (yellow). The trend in subfigure (b) therefore reflects OCR
quality as well as the compatibility of the annotation tool and dictionaries with the language in the corpus.
3.2 Word statistics
As can be seein in the “Total” rows in Table 1, the GT material contains just over 70k types (“observed
in GT”), of which almost 27k are observed more than once (“repeated in GT”). Of these repeated types,
48% have the same token counts in the OCR data as in the GT data (correctly estimated), 44% have lower
counts in the OCR data (underestimated), and 7% higher (overestimated). For almost half of the types
in this part of the vocabulary, then, we see a loss of token mass. The table also shows that the OCR data
contains an additional 37k types (for a total of 47k tokens), absent from the GT vocabulary.
The ten most frequent underestimated word types are och ‘and’, af ‘of’, att (complementizer/infinitive
marker), till ‘to’, den ‘it’, en ‘a/one’, för ‘because/(be)for’, som (relativizer), med ‘with’, and på ‘on’.
These appear 4.5k–15.5k times in the GT data, but lose generally between around 7% of their token
mass in the OCR data. The outlier here is på which has 42% lower counts, which suggests it is easily
misidentified. Indeed, we find the lost token mass with other forms, sometimes non-words like pa (counts
inflated from 4 in GT to 152 in OCR) and pä (from 7 to 1577).8
8The existence of thes non-words in the GT data shows that this material also contains errors, although in these cases we do
not know whether these were misprints or mistakes in the manual transcription.
20
Normalized word error rate
0.0 0.2 0.4 0.6 0.8 1.0
GT OCR
GT freq. region Freq. range #Tokens Freq. range #Tokens #Types
Top third 15369 – 629 161703 13834 – 367 149895 69
Middle third 628 – 24 161597 781 – 0 149023 2002
Bottom third 23 – 0 162726 1577 – 0 192484 105097
— observed in GT 23 – 1 162726 1577 – 0 145909 68312
— repeated in GT 23 – 2 119096 1577 – 0 113297 24682
Total (≥ 0) 486026 491402 107168
— observed in GT (≥ 1) 486026 444827 70383
— repeated in GT (≥ 2) 442396 412215 26753
Table 1: Summary of type and token counts, and definition of three vocabulary regions on the basis of
GT frequencies.
frän
t
l 8
kr
e h n p
r s
b tid
o
vara
k
del n:o16
in såsom
något någonni
f er hafwa
åt
ock dess går
for v d:o
några skola
u 0 all får
bar at oss
ii fl blifwit
an öfwer gångö många adress äfwen
co fin önskar års
et månad
kl måswar
nr wår no
½ à r = 0.923
0 200 400 600 800
Token count in ground truth
Figure 3: For most of the types that make up the central third in terms of token mass, frequencies based
on OCR underestimate true frequencies. Overestimations happen typically in the context of single letter
types and short words.
21
Token count in OCR
0 200 400 600 800
The most frequent overestimated types are i ‘in’ (also: letter), är ‘am/are/is’, 1, vid ‘at’, a (letter), 3,
var ‘was/were/where’, c (letter), 4, g (letter). As can be seen, these types are very short, which is not
just due to them being high frequency types. In general, the overestimated types are shorter (median of 5
characters for the types observed repeatedly in GT) than the correctly estimated or underestimated types
(8 characters). This can explained by the higher neighbourhood density of shorter words: misreading a
short word has a chance of yielding another existing word, whereas a long word is more likely to give a
nonsense type. That single character counts are inflated may have an additional cause: they can also be
the result of segmentation errors, for instance when the OCR interprets a wide-spaced heading as made
up of single letters rather than words. A precise investigation of these effects is beyond the scope of this
paper, but for now we note that the existence of both under- and overestimated types means that a single,
simple adjustment to the OCR frequencies will not suffice.
Table 1 also divides the type vocabulary into three regions of roughly equal GT token mass. Note that
the OCR frequency ranges may overlap. For instance, the extremely inflated counts of pä, discussed
above, bring the upper limit of the OCR frequency range of the bottom third well within the top range.
The proportion of overestimated types differs starkly between the three regions: in the top third 17%
are overestimated, in the middle third 11% and in the GT observed part of the bottom third, only 4%
(7% in the repeated part). This must also be explained from the stronger tendency of shorter words to
be overestimated, combined with the well known over-representation of short words among the most
frequent words. Of course, this generalization breaks down once we include the types that haven’t been
observed in the GT data: these are all by definition overestimated in the OCR data, and in addition they
are on the long side (median of 8 characters).
The shapes of the frequency distributions in the GT and OCR data resemble each other closely in the
overall data (Pearson’s r = 0.9911, only types observed in GT), as well as in the top and middle regions
(r = 0.9919 and r = 0.9230, respectively). However, in the bottom third, the correlation is low (r =
0.3657, only types observed in GT). Figure 3 plots the relation between the two frequency distributions in
the middle segment, with illustrative cases of over- and underestimation highlighted. Above the diagonal
we see the overestimated cases, which tend to be short in this segment, too. An outlier here is frän
‘pungent’, whose inflation is due to its resemblance to the highly frequent från ‘from’. The underestimated
types below the diagonal are longer and many contain diacritics. An outlier here is no ‘number’ (that is,
numero), which appears in the newspapers as No or as the abbreviature№, which the OCR software
cannot handle.
3.3 Towards a method for adjusting token counts
Although the shapes of the frequency distributions are similar, OCR-based frequency estimates are overall
too low for the GT observable part of the vocabulary, on average 18.6% lower. We will briefly consider
two ways to adjust the OCR estimates upwards to lie closer to the GT observations. In general, we do
not know the amount of underestimation, so we cannot use this number directly. However, we can try to
guess this from the dictionary coverage numbers, plotted in Figure 1b. With a different correction factor
per document, the adjusted frequencies are
cô ∑ 1unts(w) = counts(w, doc)× .
doc coverage(doc)
This methodwill overcompensate grossly, however, since themean coverage in the OCR data is only 72%.
Applying it, we get a total of 589364.4 tokens for the GT observable vocabulary (top third: 198659.5,
middle: 197112.7, bottom: 193592.2), overshooting the mark by 100k tokens. Moreover, the correlation
between GT and adjusted OCR frequencies is lower than with the raw OCR frequencies: r = 0.9899
(top: 0.9908, mid: 0.9140, bottom: 0.3445), which meant also in terms of shape we have moved away
from our target distribution.
A better estimate uses the observation that dictionary coverage on OCR data is not only a matter of
OCR quality, but also of compatibility between document language and the lexical resource. A more
22
conservative adjustment would use the latter component as an upper baseline for coverage, as in
cô ∑ compatibility(doc)unts(w) = counts(w, doc)× .
doc coverage(doc)
In a real-world setting, we do not know the compatibility of each specific document with a lexical resource,
but we may be able to guess it from what we know about similar documents. We implement this idea
here by using the smoothed coverage trend in the GT data – that is the green trend line in Figure 1 (c) –
to provide use with compatibility scores for each document.
The results are encouraging. We now arrive at 478461.6 tokens for the whole GT observable lexicon
(top: 161200.3, mid: 160126.7, bottom: 157134.6), which is just 10k tokens under the target. Although
the correlation with the GT distribution is still worse than for the unadjusted OCR frequencies, the situa-
tion is better than with our first adjustment: r = 0.9903 (top: 0.9911, mid: 0.9168, bottom: 0.3516).
4 Conclusions
In this paper, we have taken a first look at the effects of varying OCR quality on word statistics in his-
torical Swedish newspaper corpora. The picture that emerges is that, overall, the OCR-based statistics
stay closely to the true distributions in terms of shape, but that the counts for the majority of types are
underestimated. Against this trend we find a much smaller group of overestimated types, which tend to
be short and highly frequent. Using individual examples, we provide some evidence that this is related to
lexical neighbourhood effects, but a more rigorous investigation is needed to provide a more solid ground
to this account.
We also propose a method to calculate adjusted frequencies, so that the OCR estimates resemble the
true GT distributions better. Our method makes crucial use of the strong correlation between coverage
and word error rate we found in the data set. An evaluation of its application to further materials would
let us better asses this method. A particular drawback of the method is that it only targets underestimation,
and it currently only worsens the overestimated type counts. We hope that a future better understanding
of these overestimated cases will help us devise a method that addresses them specifically.
Acknowledgements
We are grateful to Martin Hammarstedt and Anne Schumacher for their help with the Sparv pipeline.
We also thank Lars Borin for recognizing the value of connected corpus and lexical infrastuctures, and
consistently driving their development, not just for modern materials but also for historical corpora. The
current paper is only a small demonstration of the many things one can do if one has both these resources
available together.
References
Yvonne Adesam, Dana Dannélls, & Nina Tahmasebi. 2019. Exploring the quality of the digital historical newspa-
per archive Kubhist. In Proceedings of the 4th Conference of The Association Digital Humanities in the Nordic
Countries (DHN), Copenhagen, Denmark, March 5-8, 2019 / edited by Costanza Navarretta, Manex Agirrezabal,
Bente Maegaard, Aachen. CEUR Workshop Proceedings.
Lars Borin, Markus Forsberg, & Lennart Lönngren. 2013. SALDO: a touch of yin to WordNet’s yang. Language
Resources and Evaluation, 47(4):1191–1211.
Anders Fredrik Dalin. 1850/1853. Ordbok öfver svenska språket [Swedish dictionary]. Vol. I–II. Joh. Beckman,
Stockholm.
Dana Dannélls, Lars Björk, Ove Dirdal, & Torsten Johansson. 2021. A two-OCR engine method for digitized
Swedish newspapers. In Selected Papers from the CLARIN Annual Conference 2020, volume 180 of Linköping
Electronic Conference Proceedings, pages 65–74. LiU Electronic Press.
Martin Hammarstedt, Anne Schumacher, Lars Borin, & Markus Forsberg. 2022. Sparv 5 user manual. Technical
report, Department of Swedish, Multilingualism, Language Technology, University of Gothenburg.
23
Mark J Hill & Simon Hengchen. 2019. Quantifying the impact of dirty OCR on historical text analysis: Eighteenth
Century Collections Online as a case study. Digital Scholarship in the Humanities, 34(4):825–843, 04.
Clemens Neudecker, Konstantin Baierer, Mike Gerber, Christian Clausner, Apostolos Antonacopoulos, & Stefan
Pletschacher. 2021. A survey of OCR evaluation tools and metrics. In The 6th International Workshop on
Historical Document Imaging and Processing, HIP ’21, page 13–18, New York, NY, USA. Association for
Computing Machinery.
R Core Team, 2022. R: A Language and Environment for Statistical Computing. R Foundation for Statistical
Computing, Vienna, Austria.
Jesper Swedberg & Lars Holm. 2009. Swensk Ordabok. Utgiven efter Uppsala-handskriften, med tillägg och
rättelser ur övriga handskrifter, av Lars Holm [Swedish dictionary. Published on the basis of the Uppsala
manuscript, with additions and corrections from other manuscripts, by Lars Holm]. Stifts- och landsbiblioteket
i Skara, Skara.
Daniel van Strien, Kaspar Beelen, Mariona Ardanuy, Kasra Hosseini, Barbara McGillivray, & Giovanni Colavizza.
2020. Assessing the impact of ocr quality on downstream NLP tasks. In Proceedings of the 12th Interna-
tional Conference on Agents and Artificial Intelligence - Volume 1: ARTIDIGH,, pages 484–496. INSTICC,
SciTePress.
24
Investigating a linguistic mini landscape:
The Tsez (Dido) dialect dictionary project
Bernard Comrie
Department of Linguistics
University of California
Santa Barbara, CA, USA
comrie@ucsb.edu
Abstract
The interim results of the Dialect Dictionary of the Tsez (Dido) Language project, in compari-
son with documentation of grammatical differences among Tsez dialects from the mid-twentieth
century, provide good information on the dynamics of grammatical and phonological change in
Tsez dialects over a period of 70 years. Morphology is stable, including morphological differ-
ences between dialects, and may be an important linguistic marker of local identity. By contrast,
two major phonological changes, vowel shortening and delabialization of consonants, have rad-
ically changed the phonologies and morphophonologies of many Tsez dialects. These changes
have spread widely, but not universally, giving rise to new isoglosses that distinguish dialects
from one another. The mini-landscape overall shows an interesting interaction of stability and
innovation, against the background of the maintenance of local identity, including in linguistic
terms.
1 Introduction
Lars Borin and I share an interest in the dynamics of language areas, including both the synchronic
distribution of typological variables within the area and the historical processes — break-up of proto-
languages with increasing diversification of descendant languages across time, as well as the effects of
language contact— that have led to the present-day distribution. We have collaborated with Anju Saxena
in investigating the large language area that is South Asia as well as one of its mid-sized sub-components,
the Western Himalayas. I happen to have a personal interest in a much smaller language area, namely the
dialect diversity within the Tsez language.
Tsez, also known by the Georgian-origin exonym Dido, is one of about a dozen small languages spo-
ken in the Tsunta and Tsumada districts of the Daghestan Republic in the North Caucasus. All belong to
the Nakh-Daghestanian (East Caucasian) language family. According to the 2010 census of the Russian
Federation, Tsez then had about 12 500 speakers, although community activists think that this is substan-
tially undercounted. Tsez is spoken in about fifty small villages, and probably each village has some
combination of grammatical and lexical features that sets it apart from each other village. However, the
different village varieties can be grouped into a limited number of dialect clusters.
The most detailed published classification of the Tsez dialects remains Imnajšvili (1963, p.9-10), based
on extensive fieldwork among the Tsez in the period 1946–1954. There is a clearcut distinction between
the Sagada dialect group and the remainder of the dialects, which latter I will call the Nuclear Tsez dialect
group; mutual intelligibility across this divide is impaired, while varieties within each of the Sagada and
Nuclear Tsez groups are readily mutually intelligible. I restrict myself in this article to the Nuclear Tsez
group. Imnajšvili divides the Nuclear Tsez group into five dialect clusters, as shown in Table 1, and this
classification remains the basis for the map in Koryakov (2002). Figure 1 is a schematic visualization
of the relative geographic location of the Tsez dialects based on Koryakov, with dialect clusters in bold
face; where more than one village dialect within a dialect cluster is cited in this article, they are indicated
in italics. In the text of the article, references are to village dialects unless cluster or group is specified.
Bernard Comrie. 2022. Investigating a linguistic mini landscape: The Tsez (Dido) dialect dic-
tionary project. In Volodina, Dannélls, Berdicevskis, Forsberg and Virk (editors), Live and
Learn – Festschrift in honor of Lars Borin, pages 25–28. Available under CC BY 4.0 25
Dialect group Dialect cluster Village dialects referred to in text
Sagada Sagada Sagada
Nuclear Tsez Kidero Kidero, Mokok
Shaitli Shaitli
Asakh Asakh, Khushet, Khutrakh, Tsebari
Shapikh Shapikh
Elbok Elbok
Table 1: Tsez dialects (classification)
Figure 1: Tsez dialects (sketch map)
Imnajšvili (1963) is a grammar, and as such concentrates on grammatical, including phonological,
differences across dialects, with only incidental attention to lexical differences. The explosion of work
on Tsez that started in the late 1980s and continues to the present day has seen continued investigation
of the grammar of the language, in particular of the dialect groups Asakh (Tsebari village) and Kidero
(Kidero and especially Mokok villages). There has also been a burgeoning of work on the lexicon, with
the main published work to date being Xalilov (1999); this work does not aim to cover the whole range of
Tsez dialects, but often gives variants for Kidero, Mokok, and Asakh. Ramazan Rajabov, from Tsebari,
compiled unpublished lexical materials in his native dialect in his capacity as Research Assistant under
NSF grant SBR-9220219 (University of Southern California; PIs Maria Polinsky, Bernard Comrie) in the
early 1990s.
The most ambitious proposal so far to document cross-dialect diversity, more specifically lexical di-
versity, in Tsez is the project Dialect Dictionary of the Tsez (Dido) Language (Abdulaev & Xalilov, In
prep). By mid-2022 preliminary version 5 of the dictionary was ready, under the authorship of Arsen
Abdulaev (from Mokok), and Madžid Xalilov (Head of the Lexicology and Lexicography Department in
the Daghestan Federal Research Center of the Russian Academy of Sciences); my own role in the project
is as a scientific advisor. The data collection phase of the project was funded by the then Department of
Linguistics of the Max Planck Institute for Evolutionary Anthropology.
Comparison of this current project with dialect differences seen in Imnajšvili (1963) provides input
into the discussion of section 2. At first, however, one might wonder to what extent one can reasonably
compare a grammar based on documentation from the mid-twentieth century with a contemporary dictio-
nary. Imnajšvili (1963) does not deal specifically with the lexicon, while Abdulaev & Xalilov (In prep)
do not deal specifically with grammar. However, there are two factors that lead to substantial compara-
bility. First, Tsez has a reasonably complex inflectional morphology, and like many such languages has
lexicalized some grammatical forms, so that these appear as separate lemmas in a dictionary. One might
compare the English adjectives interesting and tired, distinct lexical items despite their etymologies as
present participle of the verb to interest and past participle of to tire respectively. Second, an important
point of comparison is the phonology, and here both the grammar and the lexicon provide relevant ma-
terial, in particular given that the grammar gives examples of the inflectional morphology of a range of
lexical items.
26
2 The dynamics of stability and change in Tsez dialects
One respect in which Nuclear Tsezic dialects differ is in the details of morphology. For instance, negative
suffixes in some dialects begin with simple č’, while in others they begin with nč’. The negative converb
thus has two dialect variants, in -č’ey and in -nč’ey, as in the forms of the verb +–iy1 ‘to know’: +–iy-č’ey
and +–iy-nč’ey. This negative converb has become lexicalized in Tsez in the meaning ‘unknowingly,
unbeknownst’ (i.e. either the subject of the sentence or some other entity may lack knowledge — Tsez
is simply unspecific here), and as such it is listed in the dictionary. Imnajšvili (1963, p.199) shows the
suffix variant with n for the Kidero, Shaitli, and Elbok dialect clusters, the variant without n for the
Asakh and Shapikh dialect clusters. Exactly this same distribution is found in Abdulaev & Xalilov (In
prep): (s.v. рийнчIей). This seems to reflect a general tendency in Tsez dialect morphology. The dialects
are morphologically conservative, retaining features that distinguish them from other dialects as stable
markers of local identity.
Turning now to phonology, including morphophonology, two phenomena turn out to be not only of
interest given recent developments in Tsez dialects but also well represented in the citation forms of
lexical items in Abdulaev & Xalilov (In prep): long vowels and labialized consonants.
All Tsez dialects seem historically to have had long vowels, restricted to certain morphological forms
and probably originally representing vowel lengthening with concomitant reduction in the number of
qualitative oppositions. The material provided by Imnajšvili (1963) shows that long vowels were then
largely intact across the dialects, although there were already some signs of shortening. Thus, Imnajšvili
(1963, p.94) gives long-vowel forms for the dative case of the first person singular pronoun for Kidero
and Mokok (dǟr), Asakh (dār), and Shaitli (dēr), but a form with a short vowel in Elbok (dár, where
the acute accent indicates a stressed short vowel). Abdulaev & Xalilov (In prep) show that long vowels
have undergone shortening across many dialects. They are systematically retained in Mokok and Asakh,
to which we can add Tsebari. They are systematically lost in Kidero, this being, incidentally, one of
the features that now distinguishes Mokok from Kidero within Imnajšvili’s Kidero dialect cluster. The
shortening of long vowels in Kidero was noted already in Kibrik & Kodzasov (1990, p.329). The dialect
differences can be seen in a nominalized derivative of the verb gugi- ‘to be lost’. The past participle has
the suffix -ru, which requires vowel lengthening, thus giving gāgiru, as still in Mokok and Asakh. This
can then be nominalized with the suffix -ɬi to give gāgiruɬi ‘loss’, which is listed as a separate lexical
item in Abdulaev & Xalilov (In prep): s.v. гāгирулъи). The word for ‘loss’ is given there as gāgiruɬi,
with a long vowel, in Mokok and Asakh, but as gagiruɬi, with a short vowel, in Kidero. Material from
other village dialects is still being processed.
Historically, labialized consonants are attested in Tsez both in particular lexical items, e.g. kʷedin
‘sledgehammer’, including finally in some verb stems, e.g. caxʷ- ‘to write’, and as the result of desyllabi-
fication of the vowel u before another vowel in verbs, as in the infinitive +–ezʷ-a from the stem +–ezu- ‘to
look’. I concentrate here on verb forms like infinitive caxʷ-a and +–ezʷ-a. As documented by Imnajšvili
(1963, p.166), labialization was still found, apparently in all dialects in both verb types (though with some
indications of incipient loss in the form of sporadic variable labialization). By the time of Abdulaev &
Xalilov (In prep), it had basically been lost in Kidero, Mokok, Shaitli, and Shapikh, consistently retained
only in Asakh and Elbok, predominantly retained in Khushet and predominantly lost in Khutrakh — the
“predominantly” here covering variation that probably indicates a sound change in progress. In Tsebari,
the situation is more differentiated: Labialization is retained consistently in the +–ezʷ-a type, but just as
consistently lost in the caxʷ-a type. Compare the forms in Table 2.
In the language of the period documented by Imnajšvili (1963), all dialects would have followed the
pattern of Asakh village. Some explanations of the forms are in order. First, the past witnessed has the
suffix -si after a consonant, shortened to -s after a vowel; since labialized consonants are only found
phonetically before a vowel, the labialization is lost in cax-si. Second, the infinitive has the suffix -a,
before which a vowel is lost: The vowel i is simply dropped, while the vowel u is desyllabified to give
labialization in dialects that retain labialization, dropped in those that do not.
1The notation +– at the beginning of a word indicates that the given word requires a gender agreement prefix in that mor-
phological slot.
27
Past Lemma in Abdulaev
Stem witnessed Infinitive & Xalilov (In prep) Gloss
Mokok Asakh Tsebari
+–ac’- +–ac’-si +–ac’-a +–ac’-a +–ac’-a бацӏа ‘to eat’
+–ik’i- +–ik’i-s +–ik’-a +–ik’-a +–ik’-a бикӏа ‘to go’
caxʷ- cax-si cax-a caxʷ-a cax-a цаха ‘to write’
+–ezu- +–ezu-s +–ez-a +–ezʷ-a +–ezʷ-a беза ‘to look’
Table 2: Labialization in Tsez verb forms
The phonological changes of vowel shortening and delabialization thus show innovations spread-
ing across the landscape, though still sensitive to village dialect boundaries, which sometimes become
isoglosses separating the innovative and conservative forms.
3 Conclusion
For the Tsez community, the main attraction of Abdulaev & Xalilov (In prep) is the rich information it
provides on lexical differences across Tsez dialects. For linguists, however, the dictionary, even in its
present preliminary stage, contains enough information on morphology and phonology to provide new
insights into the dynamics of language stability and change across Tsez dialects since the mid-twentieth
century. In particular, morphological distinctions seem stable, perhaps as markers of local linguistic
identity. By contrast, phonological (including morphophonological) changes, such as vowel shortening
and delabialization of consonants, have spread rapidly, though they remain sensitive to dialect bound-
aries as potential isoglosses. It is to be hoped that more detailed studies of Tsez dialect morphology and
phonology will follow.
References
Arsen K. Abdulaev &Madžid Š. Xalilov. (In prep.). Диалектологический словарь цезского (дидойского) языка
[Dialect dictionary of the Tsez (Dido) language].
David S. Imnajšvili. 1963. Дидойский язык в сравнении с гинухским и хваршийским языками [The Dido
language in comparison with Hinuq and Khwarshi]. Tbilisi: Izd-vo Akademii nauk Gruzinskoj SSR.
Aleksandr E. Kibrik & Sandro V. Kodzasov. 1990. Сопоставительное изучение дагестанских языков. Имя.
Фонетика [Comparative study of Daghestanian languages: The noun. Phonetics]. Moscow: Izd-vo MGU.
Yuri Koryakov. 2002. Dagestanian languages: West. In Yuri Koryakov, editor, Atlas of the Caucasian languages:
Map 9.Moscow: Institute of Linguistics RAS. http://lingvarium.org/maps/caucas/9-andido.gif.
Madžid Š. Xalilov. 1999. Цезско-русский словарь [Tsez-Russian dictionary]. Moscow: Academia.
28
Beyond strings of characters: Resources meet NLP – Again
Dana Dannélls Tiago Timponi Torrent
Språkbanken Text FrameNet Brasil
University of Gothenburg, Sweden Federal University of Juiz de Fora, Brazil
dana.dannells@svenska.gu.se tiago.torrent@ufjf.br
Natalia Sathler Sigiliano Simon Dobnik
FrameNet Brasil CLASP, FLoV
Federal University of Juiz de Fora, Brazil University of Gothenburg, Sweden
natalia.sigiliano@ufjf.br simon.dobnik@gu.se
Abstract
FrameNet (FN) resources have existed for many languages for over a decade but their adoption
in real world applications has been limited. To celebrate the 65 anniversary of Lars Borin, the
initiator and leader of Swedish FrameNet, among others, we take a standpoint to motivate why
language resources are crucial for moving NLP forward. We present our position on (a) the need
for language resources to embrace other dimensions of text and language use, and (b) the need
for them to relate to other representations through multimodality.
1 Introduction
The late 1990’s witnessed the consolidation of two important areas in Natural Language Process-
ing (NLP). On one side, statistically-oriented methods took advantage of improved computing capacity
and corpus availability to make their way into not only mainstream computational processing of linguis-
tic structures, but also into core research in Artificial Intelligence. On the other, language resources, such
as WordNet (Miller et al., 1990) and FrameNet (Baker et al., 1998), redefined methodologies, outcomes
and expectations for the computationally assisted development of representations of linguistic cognition.
As both sub-fields evolved, though, only the first of them kept being framed as core NLP.
As a result, on top of the enormous progress derived from the development and application of models
based on embeddings and transformers – only to mention the most recent – to the analysis of human
languages, the association of computational linguistics with statistics-based approaches also brought a
misconception: the one according to which human languages can be modelled from form alone (Bender &
Koller, 2020; Merrill et al., 2021). Form in this case equally applies to multimodal models, where strings
of characters and pixels are statistically matched to emulate grounding (Kelleher & Dobnik, 2022).
Such a misconception is prejudicial for two of the main purposes of NLP: that of modelling how
human languages work and that of applying computational models to downstream tasks. Regarding the
first, an extensive body of research in Linguistics has demonstrated that linguistic form alone does not
encode meaning in a way that is either exhaustive or sufficient for comprehension – see Fauconnier &
Turner (2002) for pointers. It has also demonstrated that strict compositionality is not able to account
for very central language understanding processes (Fillmore, 1979). As for the latter, as Bender et al.
(2021) explain, models based on the statistic manipulation of linguistic form also encode unwanted social
biases, since they are trained on static limited corpora, representing, via patterns extracted from strings of
characters, limited perspectives on culture and society, excluding marginalised groups. Moreover, Rogers
Dana Dannélls, Tiago Timponi Torrent, Natalia Sathler Sigiliano and Simon Dobnik. 2022. Be-
yond strings of characters: Resources meet NLP – Again. In Volodina, Dannélls, Berdicevskis,
Forsberg and Virk (editors), Live and Learn – Festschrift in honor of Lars Borin, pages 29–36.
Available under CC BY 4.0 29
(2021) shows such models do not perform well on phenomena that are not frequent, because languages
follow Zipf’s law and most phenomena are distributed along low frequencies of occurrence.
If we want computers to approach understanding of natural languages, and also get a full-range under-
standing of our reality, we need to equip them with large amount of linguistic knowledge (Dobnik et al.,
2022). Such knowledge includes meaning, which, in turn, is structured in terms of scenes, grounded in
context and subject to construal operations (Trott et al., 2020). Such a representation of meaning cannot
be achieved via statistical processes alone. Therefore, our standpoint is that structured language resources
are crucial for moving NLP forward. Research indicates that syntactic and semantic annotated resources
compensate for quantitative data (Swayamdipta et al., 2018), providing evidence that rich linguistic re-
sources complemented with high-quality annotations are valuable in the field (Conia & Navigli, 2020;
Marton & Sayeed, 2021). Moreover, the tremendous computational and environmental costs resulting
from training deep neural network models (Strubell et al., 2019) have also been reframing the idea that
curated language resources are too expensive, while neural networks are almost free.
Nonetheless, language resources too need rethinking. Even the ones, like FrameNet (Fillmore et al.,
2003), which rely heavily on annotation to attest the analyses, traditionally adopt a very narrow definition
of what is an instance of language. Although Berkeley FrameNet (BFN) conducts most of its annotation
process on the British National Corpus (BNC) (BNC Consortium, 2007), which is balanced for genre,
the annotation disregards structure that lies beyond the syntactic locality of the target lexical unit being
annotated. In this regard, because of its lexicographic origins, BFN limits annotation to the association of
semantic and morpho-syntactic labels to parts of sentences, with no information being stored or even de-
rived, for example, about the genre macro-structure and properties in frame semantics terms. As pointed
out by Torrent et al. (2022), this is not a limitation of the theory of Frame Semantics (Fillmore, 1982),
but a limitation deriving from how the original FN model was implemented. In this paper, we arguing
argue that the lexicographic orientation of BFN has contributed to reducing text to strings of characters.
We then present our position on (a) the need for language resources to embrace other dimensions of
texts, namely those related to the characteristics of genres, which would reconcile them with statistical
language models, and (b) the need for them to go one step further and embrace multimodality, since
human communication is inherently multimodal. Before advancing to those two claims, though, the next
section presents an overview of FrameNet.
2 FrameNet in the multilingual world
FrameNet (FN), a lexical semantic resource originally developed for English, was established as a result
of a computational lexicographic project led by Fillmore and his colleagues (Baker et al., 1998). The
resource rests on the linguistic theory of Frame Semantics (Fillmore, 1982). It follows the standard view
of how humans comprehend language through a conceptual, semantic system, containing knowledge
about the world that is necessary for supporting inference, and performing cognitive tasks. In the context
of FN, frames are general schematic representations of actions, containing Frame Elements (FEs) and
Lexical Units (LUs) that can be mapped to particular instances of text type and genre.
Because FN encodes both semantic and syntactic valence information about words – the two compo-
nents that are arguably the driving force behind any NLP task that requires NLU – it has inspired new
initiatives in other languages, including, just to name a few: Japanese (Saito et al., 2008), Spanish (Subi-
rats, 2009), Swedish (Dannélls et al., 2021) and Portuguese (Torrent & Ellsworth, 2013). Following the
design and development of BFN (Fillmore et al., 2003), all languages, almost exclusively, encode system-
atic representations of semantic structures and their relations to words based on empirical evidence from
corpus data. Nevertheless, the nature of the corpora from where sentences were extracted, the methods
used to extract them, the annotation processes, and the skills the annotators possess differ greatly be-
tween the languages. Perhaps not surprisingly, despite initiatives of emulating FN in other languages for
the purpose of creating full-fledged lexical semantic resources, considerably little effort has been given
to exploiting FN resources in real-world applications. One important reason for this is the small amount
of annotated data for languages other than English, as shown in Table 1. The second reason is the lack of
context information surrounding lexical units. For example, despite the large amount of annotated sen-
30
English Japanese Portuguese Spanish Swedish
Total lexical units 13 421 3 405 8 393 1 268 39 212
Total annotation sets 200k 73k 12k 11k 9k
Table 1: FrameNet data statistics from Baker et al. (2015) with modification of the Portuguese and
Swedish data.
tences covered in BFN, not all of the LUs in a frame are attested with example sentences. Moreover, the
representativeness of the annotation in terms of both the lexical unit and the frame elements instantiated
in each annotation set is not guaranteed. In BFN many sentences are annotated with null instantiation
categories for expressing structurally omitted constituents. This lack of semantic and syntactic coverage
has limited the performance of automatic processes such as semantic role labelling (see Section 3). Third
is the abstract level of distinctiveness of the semantic categories available in frames, resulting in the lack
of mechanism for representing the intrinsic meanings of LUs. For example, the LUs dog and cat belong
to the Animal frame but there are no further semantic attributes to distinguish between them or represent
how they are perceived in the world. To include this knowledge, an addition of cumulative, context and
common-sense knowledge is required (Torrent et al., 2022).
Arguably, lexical semantic resources are hard to come by. Annotation sets are amenable to the corpus at
hand, as well as to the annotator curation in cases of manual annotations (Chang et al., 2015), something
that was acknowledged in the development and construction of Swedish FrameNet (SweFN).
SweFN is one of the largest FrameNet resources covering nearly 40k lexical units. It has been created
by reusing exiting Swedish linguistic resources and integrating them all in a large Swedish infrastruc-
ture for language technology through one pivot lexicon Saldo (Borin et al., 2021). Assuming the Zipf
behaviour which characterises lexical resources like Saldo is of great importance when resources are to
be connected (Borin, 2010). Annotation sets in SweFN were retrived from Korp (Borin et al., 2012) –
Språkbanken Text corpus infrastrcuture, which contains text types and genres from diverse spheres of
human activity, ranging from academic, medical, legal, newspapers, fiction, journals and social media.
The annotation sets provide broad coverage of the semantic and syntactic representations of LUs. An
undertaking that was achieved by balancing between computational methods and manual work, leaving
some room for human intuitions and some room for consistent, robust language processing.
However, in spite of being developed within a larger infrastructure centred around text, SweFN – like
most other FN initiatives – is yet to incorporate information that goes beyond the micro-structure of text.
In the following section we argue for this direction as a next step in the expansion of the FN model.
3 Beyond strings of characters: FrameNet meets textual genres
From the point of view of the approaches to Linguistics that take meaning and the social context of
language use inseparable from linguistic form, text is much more than sequences of characters forming a
sentence judged as grammatical in a language (Cooper, in prep). Therefore, under those perspectives, as
much as annotated corpora have been playing a key role in NLP for the past two decades and metadata
associated to raw data has proven beneficial for many applications in the field, the material traditionally
used in annotation projects – most FN initiatives included – is only one of the ingredients in a text.
Standard FN-like annotation is capable of capturing morpho-syntactic information on the instantia-
tions of FEs in a sentence. Those properties are used in BFN for building the valence patterns of LUs
and may eventually inform lexicographers of the need to split a frame or re-frame a given LU, moving
it to another frame (Petruck et al., 2004; Ruppenhofer et al., 2016). Also, BFN annotations of sentences
have been used for training semantic role labellers such as SEMAFOR (Das et al., 2010), Open Sesame
(Swayamdipta et al., 2017) and LOME (Xia et al., 2021). However, although the BNC, the main corpus
used by BFN, is balanced for genres (BNC Consortium, 2007), most of BFN annotation does not make
use of such a feature. This is so because BFN was built as a lexicographic resource, and annotated sen-
tences extracted from the BNC were meant to support the analyses carried out for a given LU. Therefore,
the portion of BFN annotations used for training SEMAFOR, Open Sesame and LOME is that of the
31
full text annotation section, which comprises only circa 3,000 sentences from mostly bureaucratic texts,
news pieces and travel guides (Das et al., 2010). This is to say, in other words, that those semantic role
labellers are trained and tested on a tiny fraction of textual genres there are, and that the dataset used for
training them does not come from a balanced corpus.
More than limiting the representativeness of the training data concerning the semantic side of anno-
tation, this is also a limiting factor for the morpho-syntactic side of the annotation. This is so because,
as demonstrated by Sigiliano & Torrent (2017), different textual genres may show different morpho-
syntactic valence affordances. In their paper, authors show, for example, that the omission of core FEs,
especially those with indefinite reference, was considerably higher in travel guides when compared to
the occurrences of the same type of null instantiation in the TED Talk annotated for the Global FrameNet
shared annotation task (Torrent et al., 2018).
Nonetheless, this is not the only limitation of annotating sequences of characters forming sentences,
instead of annotating genres. There is a myriad of types of information that genres comprise besides
the sentences in them. This is to say that they can only be defined, following Swales (1990, p.58), as
a class of events sharing a communicative purpose, a schematic structure supporting such purpose, as
well as similarities in form, style, structure and content, because members of a linguistic community
can recognise them from shared characteristics. Also, they can only be grouped together, as proposed
by Schneuwly & Dolz-Mestre (2004), because they share biases towards given linguistic operationse.g.
the common use of imperatives in instructional genres in languages such as Brazilian Portuguese and
English. A pilot experiment conducted by Dutra & Sigiliano (2021) was able to extract correlations be-
tween FN-like valence patterns and genre by annotating a corpus comprising 25 exemplar texts from
25 different genres, grouped, following Schneuwly & Dolz-Mestre (2004), under the argumentative, ex-
pository, instructional, narrative and reporting domains. For such a project, genres were imported to the
FrameNet Brasil WebTool and annotated following the principles of full-text annotation. Although con-
trolled for genre, the annotation does not take into consideration key features of genres, namely those
located beyond the verbal language communicative mode.
As Bateman (2008) points out, however, advances in technology have been highlighting the impor-
tance of other communicative modes for genre analysis. The author also notes the lack of analytical tools
for accounting for those multimodal aspects, especially in order to ground the analysis in more concrete
details of objects being analysed. Hiippala (2014) points to the fact that the lack of linearity in multi-
modal genres compromises current analytical tools generally applied to text-only genre analysis. The
author proceeds by claiming that because multimodal genres are stratified, tools used for analysing them
must be capable of identifying semiotic choices contributing to the genre structure in multiple strata. In
the following section we claim that a multimodal turn in FN might provide such kind of tool.
4 Beyond verbal language: FrameNet meets multimodality
To some extent, FN analyses, especially full-text annotation, are already capable of providing non-linear
representations of the semantics of text. As pointed out by Torrent et al. (2022), the very nature of
frames, as defined by Fillmore (1982), include both common-sense knowledge and communicative sit-
uation grounding. However, methodological decisions made when implementing Frame Semantics as a
computational lexicographic resource focusing mostly on content – as opposed to functional – words
have precluded such aspects from being properly considered in annotation.
As for communicative situation grounding, work on pragmatic frames (Ohara, 2018; Czulo et al., 2020)
has begun to indicate paths for expanding the kinds of frames FN may include, by means of looking into
pragmatic set-ups evoked by grammatical constructions. The idea is that frames should be extended to
represent linguistic knowledge activated by language structures such as deixis, turn taking and informa-
tion status. Nonetheless, once one recognises the inherently multimodal nature of human communication,
other communicative modes must be considered as well. Belcavello et al. (2020) introduce the idea for
a multimodal FN, by reporting on a pilot annotation experiment conducted on a TV travel documentary.
Authors describe the process for extracting verbal language data from the video and feeding the output
for full-text annotation. They discuss the methodology for annotating video sequences for frames using
32
Figure 1: An image annotated with the Charon video annotation interface.
bounding boxes associated to elements in the scenes. In another annotation experiment, Viridiano et al.
(2022) report on the annotation of the Flickr30k Entities dataset (Plummer et al., 2015) for frames and
FEs. Both annotation efforts are conducted in Charon, an annotation tool developed to extend FN-like
annotation to the visual mode (Belcavello et al., 2022) shown in Figure 1. This allows for the annotations
of the verbal language and the visual modes to be correlated. In the part of the TV show being annotated,
one Reykjavik local interviewed by the program host explains that the financial crisis in Iceland had a
positive impact on the people. He than enunciates (1). Such a sentence could be annotated for the LU
criativo.a evoking the Mental_property frame, as shown in (2).
(1) O povo voltou a ser criativo.
People became creative again.
(2) [O povo ] voltou a [ser [Mental_propertyPROTAGONIST Copula criativoBEHAVIOR].
The visual mode, in turn, is annotated for the Physical_artworks frame, being the graffiti on the wall
annotated for the ARTEFACT FE. The annotations of both modalities, when combined, have, thus, the
potential of enriching the semantic representation FN is capable of providing for a multimodal genre.
5 What lies ahead in the future?
Making predictions of what lies ahead in NLP is always risky, but the position that we have laid out in
this paper indicates that we should consider at least four points:
(i) Focus on the creation of high quality resources that not only overcome the current resource biases
but also cover a wide-variety of genres expressing different communicative intents. This means that data
collection is equally important as the annotation of these resources by frames. Without good data, we
cannot have good coverage of resources.
(ii) Frame annotation must invariably go beyond annotation of texts. This will disassociate the current
frame annotation bias to textual forms and would allow them to represent more common-sense knowl-
edge, making them useful for a variety of natural language inference tasks. An excellent example of this
is integration of other resources done in the SweFN project. More knowledge is always better but then a
question arises as to what degree such representations can be applied in traditional tasks such as semantic
role labelling in texts alone since not all such information is explicitly expressed in texts, and contexts
will have to be disambiguated. This brings us to the third point.
(iii) Meaning representations should be multimodal as this is how communication works. The interest-
ing question is then what modes are to be integrated and how? We have seen examples of annotation of
images and videos in FN-Brasil but one could also include gestures, emotions and sentiment. Such an
33
undertaking assumes theoretical understanding how different modalities interact and make up meaning
and also also how language is used in linguistic and non-linguistic interaction with the world. However,
despite large amount of theoretical and experimental work in these areas, such mechanisms are also not
yet fully known and understood.
(iv) How do we represent multimodal meaning in a way that it can be linked to frame representa-
tions? Bridging conceptual and perceptual domains inevitably involves classification and hence we need
representations that are hybrid: data driven and machine learned, and expert-defined encoding frame
information. Moreover, frame representations need to become multidisciplinary to capture what other
fields – such as Computer Vision, Robotics, Music and Film Theories, Mass Media Communication –
already know about meaning production.
What will be made of these points might appear in future research or Lars’ 75 anniversary Festschrift.
Acknowledgements
Research conducted in FrameNet Brasil is funded by FAPEMIG RED-00106-21 and CNPq
408269/2021 - 9 research grants. Tiago Torrent is an awardee of CNPq’s Research Productivity grant
315749/2021-0. Natália Sigiliano and Tiago Torrent are thankful for the support of the Department of
Swedish, Multilingualism and Language Technology at Gothenburg University. This work was supported
by Swe-Clarin (grant 201302003). It was also supported by the Centre for Linguistic Theory and Studies
in Probability (CLASP) (Swedish Research Council (VR) grant 2014-39).
References
Collin F. Baker, Charles J. Fillmore, & John B. Lowe. 1998. The Berkeley FrameNet project. In Proceedings of
the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference
on Computational Linguistics - Volume 1, ACL ’98/COLING ’98, pages 86–90, Montreal Quebec. Association
for Computational Linguistics.
Collin F. Baker, Nathan Schneider, Miriam R. L. Petruck, & Michael Ellsworth. 2015. Getting the roles right:
Using FrameNet in NLP. In Proceedings of the 2015 Conference of the North American Chapter of the As-
sociation for Computational Linguistics: Tutorial Abstracts, pages 10–12, Denver, Colorado. Association for
Computational Linguistics.
John Bateman. 2008. Multimodality and genre: A foundation for the systematic analysis of multimodal documents.
Palgrave MacMillan, New York.
Frederico Belcavello, Marcelo Viridiano, Alexandre Diniz da Costa, Ely Edison da Silva Matos, & Tiago Timponi
Torrent. 2020. Frame-based annotation of multimodal corpora: Tracking (a)synchronies in meaning construc-
tion. In Proceedings of the International FrameNet Workshop 2020: Towards a Global, Multilingual FrameNet,
pages 23–30, Marseille. European Language Resources Association.
Frederico Belcavello, Marcelo Viridiano, Ely Matos, & Tiago Timponi Torrent. 2022. Charon: A FrameNet
annotation tool for multimodal corpora. In Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI)
within LREC2022, pages 91–96, Marseille. European Language Resources Association.
Emily M. Bender & Alexander Koller. 2020. Climbing towards NLU: On meaning, form, and understanding in
the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,
pages 5185–5198, Online. Association for Computational Linguistics.
Emily M Bender, Timnit Gebru, Angelina McMillan-Major, & Shmargaret Shmitchell. 2021. On the dangers of
stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness,
Accountability, and Transparency, pages 610–623.
BNC Consortium. 2007. British national corpus, XML edition. Oxford Text Archive.
Lars Borin, Markus Forsberg, & Johan Roxendal. 2012. Korp – the corpus infrastructure of Språkbanken. In
Proceedings of LREC 2012, pages 474–478, Istanbul. ELRA.
Lars Borin, Dana Dannélls, & Karin Friberg Heppin. 2021. Introduction: Swedish Framenet++. In Dana Dannélls,
Lars Borin, & Karin Friberg Heppin, editors, The Swedish FrameNet++. Harmonization, integration, method
development and practical language technology applications, pages 3–36. John Benjamins Publishing Company,
Amsterdam / Philadelphia.
34
Lars Borin. 2010. Med Zipf mot framtiden – en integrerad lexikonresurs för svensk språkteknologi [With Zipf
into the future – an integrated lexical resource for Swedish language technology]. LexicoNordica, 17:35–54.
Nancy Chang, Praveen Paritosh, David Huynh, & Collin F. Baker. 2015. Scaling semantic frame annotation.
In Proceedings of The 9th Linguistic Annotation Workshop, pages 1–10, Denver, Colorado. Association for
Computational Linguistics.
Simone Conia & Roberto Navigli. 2020. Bridging the gap in multilingual semantic role labeling: a language-
agnostic approach. In Proceedings of the 28th International Conference on Computational Linguistics, pages
1396–1410, Barcelona (Online). International Committee on Computational Linguistics.
Robin Cooper. in prep. From perception to communication: An analysis of meaning and action using a theory of
types with records (TTR). To appear in Oxford Studies in Semantics and Pragmatics, Oxford University Press.
Oliver Czulo, Alexander Ziem, & Tiago Timponi Torrent. 2020. Beyond lexical semantics: notes on prag-
matic frames. In Proceedings of the International FrameNet Workshop 2020: Towards a Global, Multilingual
FrameNet, pages 1–7, Marseille. European Language Resources Association.
Dana Dannélls, Lars Borin, Markus Forsberg, Karin Friberg Heppin, & Maria Toporowska Gronostaj. 2021.
Swedish FrameNet. In Dana Dannélls, Lars Borin, & Karin Friberg Heppin, editors, The Swedish FrameNet++:
Harmonization, integration, method development and practical language technology applications, pages 37–65.
John Benjamins, Amsterdam.
Dipanjan Das, Nathan Schneider, Desai Chen, & Noah A. Smith. 2010. Probabilistic frame-semantic parsing. In
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Associ-
ation for Computational Linguistics, pages 948–956, Los Angeles, California. Association for Computational
Linguistics.
Simon Dobnik, Robin Cooper, Adam Ek, Bill Noble, Staffan Larsson, Nikolai Ilinykh, Vladislav Maraev, & Vidya
Somashekarappa. 2022. In search of meaning and its representations for computational linguistics. In Proceed-
ings of the 2022 CLASP Conference on (Dis)embodiment, pages 30–44, Gothenburg. Association for Computa-
tional Linguistics.
Lívia Dutra & Natália Sigiliano. 2021. Ferramenta linguístico-computacional como facilitadora para o ensino
de gramática na escola. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem
Humana, pages 432–436, Porto Alegre, RS. SBC.
Gilles Fauconnier & Mark Turner. 2002. The way we think: Conceptual blending and the mind’s hidden complex-
ities. Basic books, New York.
Charles J. Fillmore, Miriam R.L. Petruck, Josef Ruppenhofer, & Abby Wright. 2003. FrameNet in action: The
case of attaching. International Journal of Lexicography, 16(3):297–332.
Charles J Fillmore. 1979. Innocence: a second idealization for linguistics. In Annual Meeting of the Berkeley
Linguistics Society, volume 5, pages 63–76.
Charles J. Fillmore. 1982. Frame semantics. In Linguistic Society of Korea, editor, Linguistics in the Morning
Calm, pages 111–137. Hanshin Publishing Co., Seoul.
Tuomo Hiippala, 2014. 11. Multimodal genre analysis, pages 111–124. De Gruyter Mouton, Berlin.
John D. Kelleher & Simon Dobnik. 2022. Distributional semantics for situated spatial language? Functional,
geometric and perceptual perspectives. In Jean-Philippe Bernardy, Rasmus Blanck, Stergios Chatzikyriakidis,
Shalom Lappin, & Aleksandre Maskharashvili, editors, Probabilistic approaches to linguistic theory, CSLI
Publications, pages 319–356. Center for the Study of Language and Information, Stanford university, Stanford,
California.
Yuval Marton & Asad Sayeed. 2021. Thematic fit bits: Annotation quality and quantity interplay for event
participant representation. ArXiv.
William Merrill, Yoav Goldberg, Roy Schwartz, & Noah A. Smith. 2021. Provable limitations of acquiring
meaning from ungrounded form:What will future language models understand? Transactions of the Association
for Computational Linguistics, 9:1047–1060.
George A Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, & Katherine J Miller. 1990. Introduction
to wordnet: An on-line lexical database. International journal of lexicography, 3(4):235–244.
35
Kyoko Ohara. 2018. The relations between frames and constructions: A proposal from the Japanese FrameNet
Constructicon. In Benjamin Lyngfelt, Lars Borin, Kyoko Ohara, & Tiago Timponi Torrent, editors, Construc-
ticography: Constructicon development across languages, pages 141–164. John Benjamins, Amsterdam.
Miriam R.L. Petruck, Charles J Fillmore, Collin F Baker, Michael Ellsworth, & Josef Ruppenhofer. 2004. Refram-
ing framenet data. In Proceedings of The 11th EURALEX International Congress, pages 405–416. Université
de Bretagne Sud Lorient.
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, & Svetlana Lazebnik.
2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In
2015 IEEE International Conference on Computer Vision (ICCV), pages 2641–2649.
Anna Rogers. 2021. Changing the world by changing the data. In Proceedings of the 59th Annual Meeting of the
Association for Computational Linguistics and the 11th International Joint Conference on Natural Language
Processing (Volume 1: Long Papers), pages 2182–2194, Online. Association for Computational Linguistics.
Josef Ruppenhofer, Michael Ellsworth, Miriam R.L. Petruck, Christopher R Johnson, & Jan Scheffczyk. 2016.
Framenet II: Extended theory and practice. Technical report, International Computer Science Institute.
Hiroaki Saito, Shunta Kuboya, Takaaki Sone, Hayato Tagami, & Kyoko Ohara. 2008. The Japanese FrameNet
software tools. In Proceedings of the International Conference on Language Resources and Evaluation, LREC,
Marrakech. ELRA.
Bernard Schneuwly & Joaquim Dolz-Mestre. 2004. Gêneros orais e escritos na escola. Mercado de Letras, São
Paulo.
Natália Sigiliano & Tiago Torrent. 2017. Framenet annotation as a means to identify genre-relevant linguistic
structures. In ScriptUM: la revue du colloque VocUM, pages 1–20.
Emma Strubell, Ananya Ganesh, & Andrew McCallum. 2019. Energy and policy considerations for deep learning
in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages
3645–3650, Florence. Association for Computational Linguistics.
Carlos Subirats. 2009. Spanish Framenet: A frame-semantic analysis of the Spanish lexicon. In Hans C. Boas,
editor, Multilingual FrameNets in Computational Lexicography. Methods and Applications, pages 135–162.
Mouton de Gruyter, Berlin.
John M Swales. 1990. Genre analysis: English in academic and research settings. Cambridge university press.
Swabha Swayamdipta, Sam Thomson, Chris Dyer, & Noah A Smith. 2017. Frame-semantic parsing with softmax-
margin segmental rnns and a syntactic scaffold. arXiv preprint arXiv:1706.09528.
Swabha Swayamdipta, Sam Thomson, Kenton Lee, Luke Zettlemoyer, Chris Dyer, & Noah A. Smith. 2018. Syn-
tactic scaffolds for semantic structures. In Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing, pages 3772–3782, Brussels. Association for Computational Linguistics.
Tiago Timponi Torrent & Michael Ellsworth. 2013. Behind the labels: criteria for defining analytical categories in
framenet brasil. Veredas-Revista de Estudos Linguisticos, 17(1):44–66.
Tiago Timponi Torrent, Michael Ellsworth, Collin Baker, & Ely Edison da Silva Matos. 2018. The multilingual
FrameNet shared annotation task: a preliminary report. In Tiago Timponi Torrent, Lars Borin, & Collin F. Baker,
editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC
2018), Paris. European Language Resources Association (ELRA).
Tiago Timponi Torrent, Ely Edison da Silva Matos, Frederico Belcavello, Marcelo Viridiano, Maucha Andrade
Gamonal, Alexandre Diniz da Costa, & Mateus Coutinho Marim. 2022. Representing context in framenet: A
multidimensional, multimodal approach. Frontiers in Psychology, 13.
Sean Trott, Tiago Timponi Torrent, Nancy Chang, & Nathan Schneider. 2020. (Re)construing meaning in NLP.
In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5170–5184,
Online. Association for Computational Linguistics.
Marcelo Viridiano, Tiago Timponi Torrent, Oliver Czulo, Arthur Lorenzi, Ely Matos, & Frederico Belcavello.
2022. The case for perspective in multimodal datasets. In Proceedings of the 1st Workshop on Perspectivist
Approaches to NLP @LREC2022, pages 108–116, Marseille. European Language Resources Association.
Patrick Xia, Guanghui Qin, Siddharth Vashishtha, Yunmo Chen, Tongfei Chen, Chandler May, Craig Harman,
Kyle Rawlins, Aaron Steven White, & Benjamin Van Durme. 2021. LOME: Large ontology multilingual
extraction. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational
Linguistics: System Demonstrations, pages 149–159, Online. Association for Computational Linguistics.
36
Ordvektorer i lexikografiskt arbete
Markus Forsberg Emma Sköldberg
Språkbanken Text Språkbanken Text
Inst. för svenska, flerspråkighet Inst. för svenska, flerspråkighet
och språkteknologi och språkteknologi
Göteborgs universitet, Sverige Göteborgs universitet, Sverige
markus.forsberg@gu.se emma.skoldberg@gu.se
Abstract
We present a preliminary case study on the use of word vectors in lexicographic practice. The
study shows the potential of using vector models in the revision of existing dictionary entries as
well as creating new entries.
1 Inledning
Den lexikografiska praktiken är idag datadriven, där språkliga beskrivningar huvudsakligen baseras på
den evidens som samlas ur stora mängder text, så kallade korpusar. Språkproven presenteras oftast i form
av konkordanser, vilket med rätta kan anses vara lexikografens främsta verktyg. Atkins & Rundell (2008)
fångar orsaken i följande citat: ”One of the earliest revelations of corpus study was that that right- or left-
sorted concordances will often give a powerful, visual representation of a word’s recurrent patterns - in
a way that is impossible to ignore or overlook.”
Även bruket av olika slags ordbilder (Borin et al., 2012) är standard, inom vilka man på olika vis
abstraherar ordens språkliga kontexter för att därmed ge en bättre överblick över vad som förekommer
i ordens kontext. Ordbilder kan exempelvis vara baserade på en automatisk syntaktisk analys, där man
samlat de syntaktiska konstituenterna i separata tabeller, exempelvis en tabell med alla subjekthuvuden
för ett visst verb. Också ordbilder är viktiga verktyg för lexikografen för att kunna åstadkomma en så ut-
förlig och rättvisade bild av uppslagsorden som möjligt, och då inte minst uppslagsordens syntagmatiska
egenskaper. Frågan är om det finns andra metoder som kan tillföra nya aspekter och perspektiv.
I föreliggande arbete redogör vi för en initial studie av värdet av att använda ordvektorer som ett
verktyg i lexikografiskt arbete. Mer konkret studerar vi vad bruket av ordvektorer kan bidra med dels när
det gäller beskrivningen av etablerade svenska ord som redan införlivats i en viss ordbok, här Svenska
Akademiens Svensk ordbok (2021; hädanefter SO), dels vad vektorer kan tillföra vid beskrivningen av
nya ord som kan komma att finnas med i nästa uppdatering av samma verk.1
Innan vi kortfattat redogör för vad ordvektorer är kan det vara på plats med en kort beskrivning av
den aktuella ordboken. SO är en allmänspråklig definitionsordbok som har till uppgift att spegla samti-
da svenska. Ordboken innehåller ca 65 000 uppslagsord som är försedda med uppgifter om bl.a. uttal,
böjning, betydelse(r), konstruktion(er), fraseologi och etymologi. Tyngdpunkten ligger dock på vad upp-
slagsorden betyder och hur de används. Betydelsebeskrivningarna stöds av språkprov, såväl morfologiska
som syntaktiska. Uppslagsorden i SO relateras också till andra uppslagsord i samma verk genom hänvis-
ningar till bl.a. synonyma, antonyma och kohyponyma ord. Ordboken kan därmed sägas utgöra ett slags
semantiskt nätverk mellan olika lexikala enheter i svenskan (Blensenius et al., 2021). SO är tillgänglig
dels i form av appar, dels via ordboksportalen svenska.se.
1SO utarbetas inom ramen för ett samarbete mellan Svenska Akademien och GU och av ett forskarlag inom vilket vi själva
ingår.
Markus Forsberg and Emma Sköldberg. 2022. Ordvektorer i lexikografiskt arbete. In Volodina,
Dannélls, Berdicevskis, Forsberg and Virk (editors), Live and Learn – Festschrift in honor of
Lars Borin, pages 37–41. Available under CC BY 4.0 37
2 Ordvektorer
Ordvektorer, även kallade ordinbäddningar, är matematiska representationer av ord som försöker fånga
ordens egenskaper på så vis att relaterade ord hamnar nära varandra i en så kallad vektorrymd. Detta görs
med hjälp av ordens kontexter som hämtas ur en stor mängd textmaterial. Uttryckt annorlunda anses ord
vara relaterade med varandra när de förekommer i samma typ av språkliga kontext.
Ordvektorernas kvalitet är beroende av hur många relevanta språkliga kontexter som förekommit i det
textmaterial man använt för att skapa respektive ordvektor. Med tanke på att ordens statistik är zipfiansk
distribuerad, dvs. snabbt avtagande, kommer det oundvikligen att finnas ord som ligger nära varandra i
vektorrymden beroende på slumpmässighet snarare än kontextuell likhet. Detta behöver man ta höjd för
i en studie som denna.
Det finns många olika metoder för att skapa ordvektorer, och alla dessa metoder har sina för- och
nackdelar. Den metod vi har valt i den här studien har implementerats i verktyget fastText (Bojanowski
et al., 2016), som representerar varje ord utifrån dess delar. Fördelen med detta är att ord som tidigare
inte observerats kan hamna rätt i vektorrymden, givet att delarna har observerats. Tekniskt säger man att
metoden kan hantera OOV (Out of Vocabulary). Detta är en egenskap som är viktig vid studier av ett
språk som svenska med sin rika produktion av nya sammansättningar.
Verktyget fastText distribueras tillsammans med förtränade ordvektorer för 157 språk (Grave et al.,
2018), och vi har använt den för svenska. Ordvektorerna är baserade på datamängden Common Crawl,
som är ett stort insamlat material med webbsidor, och Wikipedia.
Det finns en rik flora av vetenskapliga arbeten som kretsar runt hur frukten av lexikografiskt arbete
kan användas till att förbättra ordvektormodeller, men det finns färre som behandlar hur ordvektorer kan
användas i lexikografisk praktik. En studie som ligger nära den studie vi rapporterar om här är Sørensen
& Nimb (2018), som använder sig av en annan populär metod för att skapa ordvektorer, word2vec, som
ett lexikografiskt verktyg för att hitta ord som saknas i ett visst semantiskt fält eller för att identifiera
inkonsekvenser i existerande beskrivningar.
3 Studien
Inom ramen för denna undersökning har vi studerat sammanlagt 20 svenska ord. Hälften av dessa be-
handlas i SO (2021), hälften av dem kan betraktas som nyordskandidater i förhållande till SO. Känne-
tecknande för den senare typen av ord är att de finns med i en förteckning med nyordskandidater vilken
utarbetats genom en jämförelse mellan svenska textmaterial från 2021 med motsvarande material från
2020. De aktuella orden har antingen dykt upp som nyheter i det senare materialet eller ökat i användning.
De tio etablerade orden är: adekvat, disputation, fräsch, hund, kärlek, organisera, röd, sjunga, usch
och åldras. Som synes tillhör de olika ordklasser. Vidare uppträder de oftast i olika sammanhang, inte
minst i olika slags texter.
De tio nyordskandidater som vi granskat är: blåbrun, gangsterrap, glamping, kontaktförbud, matresa,
mockumentär, prosecco, smittovåg, snabbtesta och yes. De flesta av dessa är substantiv, men det säger
också något om hur ordklassfördelningen ser ut i hela listan med nyordskandidater.
Vid granskningen av de 20 orden har vi studerat vad de har för 100 närmaste grannar i vektorrymden.
Här är t.ex. en förteckning över de 100 närmaste grannarna för substantivet kärlek, sorterade i fallande
ordning baserat på avstånd (siffrorna i parentes):
kärlek. (0.783), kärlek.En (0.779), kärleken (0.778), Kärlek (0.764), självkärlek (0.751), familjekärlek
(0.742), föräldrakärlek (0.739), kärlekDu (0.732), kärlek- (0.729), hat-kärlek (0.728), kärlek.Och (0.722),
vänskap (0.718), moderskärlek (0.714), Gudskärlek (0.713), livskärlek (0.711), kärlekAtt (0.710), kärleks-
fullhet (0.706), människokärlek (0.703), kärlek.Men (0.701), syskonkärlek (0.685), kärlekslycka (0.681), ton-
årskärlek (0.680), kärlek.Jag (0.675), förälskelse (0.675), kärlekssorg (0.675), kärlekskänslor (0.674), kär-
lek.Det (0.673), kärlekshandling (0.672), kärleksrus (0.670), kärleksmagi (0.667), Gudakärlek (0.665), läng-
tan (0.662), egenkärlek (0.660), kärlekskraft (0.658), kärlekDag (0.655), passion (0.653), moderskärleken
(0.653), kärleksförklaring (0.649), kärlekslöshet (0.649), kärlekslängtan (0.648), Självkärlek (0.648), roman-
tik (0.647), broderskärlek (0.647), nyförälskelse (0.645), människokärleken (0.643), kärleksbevis (0.642),
hjärtesorg (0.642), tvåsamhet (0.639), vänskap. (0.639), kärleksgärning (0.635), kärleksberättelse (0.635),
förälskelser (0.635), villkorslös (0.633), kärlekar (0.631), Kärleken (0.630), kärlekstrubbel (0.629), matkär-
lek (0.628), omtänksamhet (0.627),moderlighet (0.627), kärlekEnsamhet (0.626), wikikärlek (0.625), kärlek-
stecken (0.625), kärlekshandlingar (0.625), sanningskärlek (0.625), Förälskelse (0.624), kärleksförhållanden
38
(0.622), kärlekslyrik (0.621), barnlängtan (0.621), Nätkärlek (0.619), ömhet (0.619), kärleksakt (0.618), kär-
leksljus (0.618), Hatkärlek (0.617), kärleksbok (0.616), kärleksdag (0.616), egenkärleken (0.615), Wikikär-
lek (0.614), Människokärlek (0.613), kärlekarna (0.613), villkorslösa (0.613), glädje (0.612), känslosamhet
(0.611), känslor (0.610), ömsinthet (0.609), kärlekDe (0.609), kärleksfyllt (0.608), kärlekshyllning (0.608), om-
tanke (0.607), kärleksband (0.607), kärleks (0.607), kärleken. (0.607), kärleksarbete (0.606), Moderskärlek
(0.605), kärleksfyllda (0.605), kärle (0.604), kärleksförklaringar (0.604), kärlekLäs (0.604), kärleksförhållan-
det (0.604), sorg (0.603), ledsamhet (0.602)
Bland grannarna finns det som synes en hel del brus som kärlek.En och kärlekDe, som kommer sig av
den tokenisering som Grave et al. (2018) använt när det skapade den svenska vektormodellen.
4 Nedslag i vektorrymden
Listorna med grannar ser i korta ordalag väldigt olika ut för de 20 utvalda orden. På grund av det begrän-
sade utrymmet i detta sammanhang sammanfattar vi i det följande några av de typer av iakttagelser som
man kan göra gällande de ord som studerats.
4.1 Etablerade ord
Här är ett par sammanfattande ord om grannarna till några av de etablerade ord som valts ut:
kärlek: Bland grannorden finns ord som betecknar olika slags kärlek, ex. moderskärlek, syskonkärlek,
tonårskärlek och hatkärlek. Det finns också ord som betecknar andra relaterade känslor m.m. såsom
vänskap, förälskelse, längtan, passion, romantik, nyförälskelse, omtanke, ömhet, hjärtesorg, tvåsamhet.
Det finns också ett typiskt attribut: villkorslös.
adekvat: De 100 grannarna utgörs nästan uteslutande av andra adjektiv. De flesta är synonymer eller
närsynonymer till adekvat, ex. tillfredsställande, erforderlig, fullgod, ändamålsenlig. I listan finns det
också ord som kan betraktas som mer eller mindre antonyma, t.ex. inadekvat, otillräcklig, otillfredsstäl-
lande. Inget av dessa ord upptas bland hänvisningarna i artikeln adekvat i SO (2021).
disputation: Huvuddelen av grannorden betecknar företeelser som hör till den forskarstuderandes
vardag, t.ex. doktorsavhandling, slutseminarium, spikning, disputationsfest. Samtliga grannar utom en
handfull utgörs av substantiv. Till undantagen hör det centrala verbet disputera.
röd: En klar majoritet av de 100 grannorden utgör mer eller mindre vanliga färgbeteckningar såsom
gul, buteljgrön och rödviolett. Övriga ord slutar nästan samtliga på efterleden -färgad, t.ex. laxfärgad.
sjunga: I listan finns det såväl verb, t.ex. nynna, joddla och gnola, som substantiv. Flera av dessa
senare betecknar olika slags sånger, bl.a. luciasången, julvisor, psalmer. De aktuella orden står således i
olika relation till det undersökta verbet.
4.2 Nya ord
Här är några sammanfattande ord om grannarna till några av de nyord som valts ut:
gangsterrap: Bland grannarna finns dels sammansättningar med gangster som för- eller efterled, t.ex.
gangsterkung, smågangsters, men också många ord som innehåller ordetmaffia, t.ex.maffiaorganisation,
knarkmaffian. Det finns också andra ord som förknippas med kriminalitet, t.ex. droghandlare men också
ord som har med hiphop att göra. I listan dyker slutligen stavningsvarianten gangstarap upp.
glamping: Ordet står för ’glamorös camping’. I listan med grannar finns det en rad sammansättningar
med camping vars förled visar på dagens mångfald när det gäller campingliv, t.ex. husvagns-, fri-, fiske-,
vildmarks-, vintersports- och naturist-. Värt att notera är att det i listan med grannar inte finns några spår
av komponenten ’glamour’.
mockumentär: På plats 5 bland grannarna finns en sammansättning, fejkdokumentär, som utgör en
god omskrivning och viktig signal till lexikografen om vad ordet betyder. I listan med grannar återfinns
för övrigt ett stort antal olika slags filmer. Några har dokumentärt innehåll, t.ex. minidokumentär, drama-
dokumentär, naturdokumentär, andra har mer fiktivt innehåll, t.ex. zombiefilm, äventyrsfilm.
snabbtesta: Bland grannarna finns dels substantivet som verbet är avlett ifrån (snabbtest), dels andra
sammansättningar som slutar på -testa, t.ex. hårdtesta, trycktesta, funktionstesta, stresstesta, betatesta.
yes: Grannarna består av många engelsklingande ord, ex. yeah, shit, kidding, well, say, indeed. Några
av dessa grannar, som t.ex. mer svenskklingande jaaaaa, aaaah, tadaaaa, tjoho och niiiice, kan nog i
39
vissa sammanhang vara mer eller mindre synonyma till ordet ifråga. Värt att notera är också hur dessa
och andra ord i listan stavas.
5 Diskussion
Studien visar att förhållandena mellan de 20 ord som granskas och deras respektive grannar varierar. Ett
skäl till detta kan vara att de ord som studeras är av olika slag, vilket påverkar deras språkliga kontexter:
de har olika ordklass, de har olika frekvens och spridning i svenska texter av idag, de hör till olika stilnivå
etc.
Exempel är adekvat vars grannar nästan uteslutande är andra adjektiv. Ett annat exempel är hund
vars grannord till synes enbart utgörs av substantiv. Detta är en fördel när lexikografen letar efter nya
hänvisningar till SO, vilka återfinns under rubriker som Synonym, Antonym och Jämför. Tidigare studier
har visat att SO kan stärkas när det gäller sådana kopplingar mellan artiklar (Blensenius et al., 2021). Ett
ord som sjunga kan så förses med illustrativa hänvisningar till lemman som nynna och gnola. Ett nyord
som prosecco kan kopplas till befintliga SO-ord som champagne och cava.
Vidare finner lexikografen gott om ord som kan tjäna som ytterligare morfologiska exempel till såväl
redan beskrivna ord som nya sådana. I ett fall som smittovåg blir den variation som gäller fogen vid
sammansättningar med smitta mycket tydlig.
Ordvektorer kan även bidra till information om andra uppslagsord än de här undersökta. I ett fall som
snabbtesta står det klart att verbartikeln testa i SO saknar typiska sammansättningar som t.ex. hårdtesta,
stresstesta och betatesta. Undersökningen av ordet glamping visar på hur artikeln camping i SO skulle
behöva moderniseras.
Över huvud taget kan ordvektorer ringa in olika slags semantiska fält. Ett exempel är kärlek med
grannar som betecknar olika typer av känslor, tillstånd m.m. Dessa är viktiga när man ska klargöra för
ordet kärleks betydelse. I en ordbild till kärlek visas emellertid även andra relaterade ord och av olika
ordklass, såsom adjektivet obesvarad och verbet hysa. Sådana ord är givetvis också mycket viktiga för
lexikografen vid beskrivningen av inte minst kollokationer.
Grannorden kan också förtydliga samhälleliga förändringar som avspeglas i svenskans ordförråd. Ex-
empel är grannarna till ordet matresa som visar på den mångfald av temaresor som arrangeras.
Kanske kan grannorden också säga något om det aktuella ordets värdeladdning och konnotationer.
Exempel är åldras där det är slående hur många negativt laddade ord det finns bland grannorden. Sam-
tidigt kan grannorden ha viss slagsida åt någon av betydelsekomponenterna hos det ord som undersöks.
Se t.ex. glamping. I listan med grannar finns det som sagt en rad kohyponymer som betecknar olika
campingvarianter, men inget spår av komponenten ’glamour’.
När det gäller lånord som yes, så ser vi även en hög grad av engelska ord bland grannarna, vilket visar
att det finns en hel del engelska texter i det material som använts för att bygga upp vektormodellen.
Slutligen finner man här och var nya goda nyordskandidater, t.ex. efterforskningsförbud som är granne
till kontaktförbud.
6 Slutsats
Ordvektorer ger oss de ord vars språkliga kontexter liknar varandra, vilket kompletterar vad vi får av
konkordanser och ordbilder, som ger en överblick av specifika ords kontexter. Ordvektorer kan uppmärk-
samma lexikografen på bl.a. ord som kan tjäna som morfologiska språkprov och som hänvisningar till
relaterade artiklar. De kan således bidra till att det semantiska nätverk som exempelvis SO redan utgör
stärks ytterligare. Vidare kan de hjälpa lexikografen att, på ett förhållandevis objektivt och datadrivet sätt,
se kopplingar mellan befintliga uppslagsord i ordboken och sådana som ska läggas till i samband med en
revidering.
40
Efterord
Vi vill tacka2 Lars Borin för att han är, och alltid varit, en sådan positiv kraft i våra professionella yrkesliv
och i den verksamhet som vi delar.
Referenser
B.T. Sue Atkins & Michael Rundell. 2008. The Oxford Guide to Practical Lexicography. Oxford
University Press, Oxford. 2010.
Kristian Blensenius, Emma Sköldberg, & Erik Bäckerud. 2021. Finding gaps in semantic descriptions.
visualisation of the cross-reference network in a swedish monolingual. In Proceedings of the eLex
2021 conference.
Piotr Bojanowski, Edouard Grave, Armand Joulin, & Tomas Mikolov. 2016. Enriching word vectors
with subword information. arXiv preprint arXiv:1607.04606.
Lars Borin, Markus Forsberg, & Johan Roxendal. 2012. Korp the corpus infrastructure of språkbanken.
In Proceedings of LREC 2012. Istanbul: ELRA, volume Accepted, pages 474–478.
Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, & Tomas Mikolov. 2018. Learning
word vectors for 157 languages. In Proceedings of the International Conference on Language Resour-
ces and Evaluation (LREC 2018).
Nicolai Hartvig Sørensen & Sanni Nimb. 2018. Word2dict lemma selection and dictionary editing
assisted by word embeddings. In Iztok Kosem Jaka ibej, Vojko Gorjanc & Simon Krek, editors, Pro-
ceedings of the XVIII EURALEX International Congress: Lexicography in Global Contexts, pages
819–826, Ljubljana, Slovenia, jul. Ljubljana University Press, Faculty of Arts.
2This work has been supported by Nationella språkbanken and HUMINFRA, both funded by the Swedish Research Council
(20182024, contract 2017-00626; 20222024, contract 2021-00176) and their participating partner institutions.
41

The diachrony of political terror:
Tracing terror and terrorism in Swedish parliamentary data 1867-1970
Mats Fridlund, Daniel Brodén, Victor Wåhlstrand Skärström
Centre for Digital Humanities
University of Gothenburg, Sweden
name.surname1.surname2@gu.se
Abstract
The paper explores the development of the closely related words ‘terror’ and ‘terrorism’ as man-
ifested in the discourse of the Swedish Parliament, 1867–1970, drawing on digital history and
language technology methodologies and tools. Combining distant and close reading, we show
that terror-related words first gained traction from 1918 and onwards. The recorded uses of words
and compounds indicate that terror-related phenomena were often associated with states rather
than individuals, but also that terror-related words have been used metaphorically in relation to
non-violent domestic issues. Our results confirm the argument that the word terrorism primarily
gained its modern meaning in the early 1970s. We conclude by stressing the potential of com-
bining LT-driven and interpretative approaches for investigating the diachronicity of words in
Parliamentary corpora.
1 Introduction
This study provides a step towards a digital history of the Swedish political discourse on political terror
by means of distant and close reading of parliamentary texts. Drawing on a mixed-methods approach,
we explore the development of the closely related words ‘terror’ and ‘terrorism’ as manifested in the
discourse of the bicameral Parliament throughout its existence, 1867–1970.
From an etymological perspective, terror (same spelling in Swedish and English) designates an intense
state of fear or horror and has been used in Swedish written accounts since at least the 1600s. Terrorisera
(‘to terrorize’) means to put people in such a state through one’s actions. Terror was rarely used before
1918 when it gained a second meaning, signifying the use of certain means ‘i politiskt syfte för att sprida
skräck o. därvid uppnå vissa mål’ by both state and sub-state actors (saob.se). Likely, this new use derived
from English or Russian and was connected to the 1917 Russian Revolution and Finnish Civil War 1918.
Terrorism had entered Swedish in the early 19th century, primarily referencing the French Revolution’s
Reign of Terror (initially translated as skräckväldet) and was similarly associated with state repression. In
the 1970s, terrorism gained its modern meaning, becoming distinctly associated with sub-state violence
against civilians or non-combatants (Stampnitzky, 2013).
The aim of this study is to explore how the words terror and terrorism have been used in Swedish
parliamentary discourse, focusing on the different meanings that have been ascribed to them over time.
To map these meanings we draw on the digitized parliamentary records and the resources of Språkbanken
Text. Partly, we follow prior initiatives by the Swedish CLARIN node, Swe-Clarin, where humanities and
social sciences scholars collaborate with researchers in natural language processing in using LT-based
e-science tools for HSS research (Karsvall & Borin, 2018; Viklund & Borin, 2016). The present study
builds on prior explorations together with Lars Borin of the newspaper discourse on terrorism in Sweden
and Finland (Fridlund et al., 2019; Fridlund et al., 2020; Fridlund et al., 2022).
Specifically, we ask two research questions regarding the understanding of the phenomenon of terror-
ism in the Swedish parliamentary discourse during the period in focus: (1) What variations of meanings
Mats Fridlund, Daniel Brodén and Victor Wåhlstrand Skärström. 2022. The diachrony of polit-
ical terror: Tracing terror and terrorism in Swedish parliamentary data 1867-1970. In Volodina,
Dannélls, Berdicevskis, Forsberg and Virk (editors), Live and Learn – Festschrift in honor of
Lars Borin, pages 43–47. Available under CC BY 4.0 43
Figure 1: The diachrony of political terror in the Swedish Parliament 1867–1970. First occurrences of
new derivations of terror in the bicameral corpus and staples showing the use of all derivations. The top
left table shows the terror lexemes with more than 10 occurrences.
have the words terror and terrorism had when used in isolation and as compound words, and (2) what
terror-related compounds have been added to the discourse over time?
2 Analyzing the bicameral Riksdag parliamentary data
The parliament’s texts available at the National Library of Sweden (riksdagstryck.kb.se) (retrieved 2021-
11-11) consist of 10 categories, such as motions, propositions and minutes of the debates. The material
was processed for analysis with tokenization, lemmatization and dependency parsing by means of the
Sparv Pipeline tool designed for automatic neural and statistical annotation of documents with textual
structure and linguistic properties for Swedish applications (Borin et al., 2016; Ljunglöf et al., 2019;
Hengchen & Tahmasebi, 2021). Our processed corpus was subsequently made accessible through the
qualiquantitative Context tool (developed by Wåhlstrand Skärström) for investigating linguistic repre-
sentations, i.e., words and phrases, in the texts. Context aids distant reading and enables the production
of quantitative results, such as relative and absolute frequency for queries and close reading of the context
of a search query, which enables qualitative and interpretative analysis.
For the analysis, the parliamentary texts were grouped by year, irrespective of their subcollection, and
tokenized into individual words and subsequently lemmatized and queried for the head word, e.g terror,
per year. This filtered volume was then manually curated to remove errors from OCR or lemmatization.
This produced data on a ‘diachrony’ of terror-related words – the frequency of usage of terror-related
lexemes (staples in Figure 1) and innovation of new words – indicating the yearly growth of terror-related
terms and the production of compounds. In the analysis, significant individual occurrences (generally the
first) were interrogated by closer readings of the parliamentary records.
44
2.1 Volume of roots and compounds
The search string *terror* generates 1.016 hits in our corpus. The majority (606) are different grammati-
cal forms of the seven terms terror (404), terrorisera (81), terrorism (44), terroristisk (43), terrorist (21),
terroriserande (10) and terrorisering (3), all of which may be considered derivations from the root word
terror. However, since they are all separately productive, we will consider them roots by their own right
The rest consists of 102 compounds from the root terror (82 different compounds), terrorist (12), terror-
ism (5), terrorisera (1), terrorisering (1) and terroristisk (1), either as the modifier or head constituent.
Focusing on the two core roots, terrorism was used already 1867, the bicameral parliament’s first year.
However, close reading shows its 19th century use to be nonlethal metaphorical, primarily to denote
valterrorism, perceived oppressive parliamentary voting procedures. In the 1900s terrorism becomes
used for violent and even lethal activities. First in connection with labor disputes and later in 1918 with
the Finnish Civil War, that also introduced terror as a significant concern for the Swedish Parliament.
Usage and new compositions rose sharply from 1918 (from 53 to 542 1918–1970). Notably, terror had
only been used on one occasion in 1903 to refer to the British warship HMS Terror. Furthermore, it is
striking that although the use of terrorism preceded that of terror it is almost absent before 1970 (26 hits
1900–1969) something that we discuss further below.
Regarding most common usage, 13 words occurred 10 times or more (see table in Figure 1): 5 sim-
ple terms and the 8 compound words terrorbalans (‘terror balance’), terrorvapen (‘weapons of terror’),
terroranfall (‘terror attacks’), blodsterror (‘blood terror’), fackföreningsterror (‘trade union terror’), ter-
rorbombning (‘terror bombing’), terrorregim (‘terror regime’) and terrordåd (‘terror deed’). Notably,
four of these – terrorbalans, terrorvapen, terroranfall and terrorbombning – are used in reference to
Cold War nuclear terror. Also, close reading reveals the first occurrence of terrordåd (in 1933) to refer to
individuella terrordåd (‘individual acts of terror’), i.e. political violence that today would be discussed
in terms of terrorism.
2.2 Productivity of compounds
Our results show six roots and one compound (valterrorism) emerging before 1918 and one root (ter-
rorisering) and 102 compounds after 1918. As far as compound words with terror are concerned, of
the 83 in total, eight have more than 10 instances (Figure 1): terrorbalans (‘-balance’), terrorvapen
(‘-weapon’); terroranfall (‘-attack’); blodsterror (‘blood-’); fackföreningsterror (‘trade union-’); terror-
bombning (‘-bombing); terrorregim (‘-regime’), terrordåd (‘-deed). Furthermore, in line with the rare
use of terrorism discussed above, there are only six uses of the four (methaphorical) compounds with
-terrorism (val-, bil-, motor-, blockad-) as compared to those with terror and terrorist (12).
States’ involvement in terror activities are referred to in several of the terror compounds such as terror-
regim (13), terrorvälde (9), polisterror (6), terrorland (2), terrordiktatur (1) and terrorregement (1), as
well as other that refer to warfare involving states, such as terrorkrig and terrorbombning. Notably, stat-
sterroristisk (1) and terrorstat (3), were the only compounds with stat (‘state’), although the former was
metaphorically used to refer to domestic governance issues, which was also the case with statsrådster-
ror (1) (c.f. Ängsal et al., 2022, on state and terrorism compounds in the Swedish parliamentary debate
1993–2018). Thus, our results show the associations between states and terror to have a long Swedish
history (c.f. Fridlund et al., 2020).
One can also distinguish periods of compound productivity grounded in domestic and geopolitical
trends contexts, such as arbetsmarknadsterror in 1925–1935 referring to terror by and against labour
unions and employers, luftterror in 1936–1940 denoting the threat of wartime aerial bombings against
civilian targets, and atomterror in 1948–1963 denoting the nuclear threat during the Cold War.
2.3 The rise of terrorism
What is missing from the above discussion is any references to words related to the insurgent violence
perpetrated by militant organizations such as the Popular Front for the Liberation of Palestine (PFLP)
and the West German Red Army Faction (RAF) that became synonymous with terrorism in the early
1970s (prior to this, other words were sometimes used as labels for similar forms of political violence,
45
including anarkism (‘anarchism’)). In fact, Stampnitzky argues that the transnational character of this
form of violence – skyjacking, hostage taking for political purposes, etc. – and its impact on the modern
world order generated a need for a discursive term: ‘the concern was with violence out of place – spilling
over from local conflicts into the international sphere’ (Stampnitzky, 2013, p.27).
The Swedish parliamentary debate provides a clear illustration of this argument when an MP in 1970
claimed that the use of ‘terror och motterror’ (‘counter terror’) could have dire consequences: ‘När ter-
rorgrupper tillgriper sådana metoder som kapning eller rent av förstörelse av flygplan med oskyldiga
civila passagerare, eller mord på diplomater från utomstående länder, hotas hela det regelsystem för den
internationella samlevnaden, som mödosamt byggts upp under lång tid. ‘(1970–04–29)’ The pejorative
quality of the word terrorism also made it useful as a rhetorical tool. Later that year, a liberal MP sarcasti-
cally commented on the New Left’s advocacy of using violent means for political purposes and in doing
this introduced in the Parliament terrorism in its new emerging meaning: ‘När den nygamla vänstern vill
försvara våldsmetoder får det inte sammanblandas med advokatyr för terrorism. Det skulle låta alltför
illa!’ (1970–10–29).
3 Conclusion
This paper provides an attempt to understand the development of the closely related words terror and ter-
rorism as manifested in the Swedish parliamentary discourse, 1867–1970. By applying the tools Sparv
and Context, we have explored the development of these two words in isolation and as parts of com-
pounds in parliamentary texts. Combining distant and close reading, we have shown that terror-related
words gained traction from 1918 and onwards. Furthermore, the uses of the words of interest and their
compounds clearly indicate that terror-related activities were, to a large extent, associated with states
rather than individuals. At the same time, our results confirm the familiar argument that the word terror-
ism gained its modern meaning in the early 1970s. On another level, the paper illustrates the potential
of combining LT-driven and interpretative approaches for analysing the diachronicity of words in Parlia-
mentary corpora.
Acknowledgments
This study is part of The Cultural Imaginary of Terrorism; Terrorism in Swedish Politics; and Things
for Living with Terror funded respectively by Marcus and Amalia Wallenberg Foundation; VR, RJ and
Vitterhetsakademien; and RJ. It is also supported by Swe-Clarin funded partly by VR (contract no. 2017-
00626).
References
Lars Borin, Markus Forsberg, Martin Hammarstedt, Dan Rosén, Roland Schäfer, & Anne Schumacher. 2016.
Sparv: Språkbankens corpus annotation pipeline infrastructure. In The Sixth Swedish Language Technology
Conference (SLTC), Umeå University, pages 17–18.
Mats Fridlund, Leif-Jöran Olsson, Daniel Brodén, & Lars Borin. 2019. Trawling for terrorists: A big data anal-
ysis of conceptual meanings and contexts in Swedish newspapers, 1780–1926. In HistoInformatics 2019: Pro-
ceedings of the 5th International Workshop on Computational History; co-located with the 23rd International
Conference on Theory and Practice of Digital Libraries (TPDL 2019). Oslo, pages 30–39. CEUR-WS.
Mats Fridlund, Leif-Jöran Olsson, Daniel Brodén, & Lars Borin. 2020. Trawling the Gulf of Bothnia of news: a
big data analysis of the emergence of terrorism in Swedish and Finnish newspapers, 1780–1926. In Proceedings
of CLARIN Annual Conference, pages 61–65.
Mats Fridlund, Daniel Brodén, T. Jauhiainen, L. Malkki, Leif-Jöran Olsson, & Lars Borin. 2022. Trawling and
trolling for terrorists in the digital Gulf of Bothnia: Cross-lingual text mining for the emergence of terrorism in
Swedish and Finnish newspapers, 1780–1926. In CLARIN: The Infrastructure for Language Resources, pages
781–802. Berlin: De Gruyter.
Simon Hengchen & Nina Tahmasebi. 2021. A collection of Swedish diachronic word embedding models trained
on historical newspaper data. Journal of Open Humanities Data, 7:1–7.
46
Olof Karsvall & Lars Borin. 2018. SDHK meets NER: Linking place names with medieval charters and historical
maps. In Proceedings of DHN 2018. Aachen. CEUR-ws.org, pages 38–50.
Peter Ljunglöf, Niklas Zechner, Luis Nieto Piña, Yvonne Adesam, & Lars Borin. 2019. Assessing the quality of
Språkbankens annotations. Technical report, University of Gothenburg, Department of Swedish.
Lisa Stampnitzky. 2013. Disciplining terror: How experts invented ’terrorism’. Cambridge University Press.
Jon Viklund & Lars Borin. 2016. How can big data help us study rhetorical history? In Selected Papers from the
CLARIN Annual Conference 2015, pages 79–93. Linköping University Electronic Press.
Magnus P Ängsal, Daniel Brodén, Mats Fridlund, Leif-Jöran Olsson, & Patrik Öhberg. 2022. Linguistic framing
of political terror: Distant and close readings of the discourse on terrorism in the Swedish parliament 1993–2018.
In Proceedings of CLARIN Annual Conference 2022, Prague, pages 69–72.
47

UD-based Latvian FrameNet
Normunds Grūzı̄tis Gunta Nešpore-Bērzkalne
IMCS, University of Latvia IMCS, University of Latvia
normunds@ailab.lv gunta@ailab.lv
Baiba Saulı̄te
IMCS, University of Latvia
baiba@ailab.lv
Abstract
We, students of Lars Borin from the good old NGSLT times, present Latvian FrameNet. This is
a part of a larger work on the creation of a balanced multilayered corpus of Latvian, anchored
in cross-lingual state-of-the-art syntactic and semantic representations: Universal Dependencies
(UD), FrameNet and PropBank, as well as Abstract Meaning Representation. We have been in-
spired a lot by the Swedish FrameNet++ (SweFN++), yet there are some differences: we stick to
the frame inventory of Berkeley FrameNet, and the FrameNet annotation layer is added on top
of a manually curated UD layer. Thus, the annotation of frames, frame elements (FE), and FE
spans is guided by the dependency structure of a sentence. We strictly follow a corpus-driven ap-
proach – lexical units (LU) in Latvian FrameNet are created only based on the annotated corpus
examples. Therefore, in contrast to SweFN++, Latvian FrameNet is definitely not the largest one
in terms of LUs, but, to our knowledge, it is the first FrameNet-annotated corpus that has been
created as an extension of an UD treebank.
1 Introduction
In the industry-oriented research project “Full Stack of Language Resources for Natural Language Un-
derstanding and Generation in Latvian”,1 we have created a balanced text corpus with multilayered an-
notations (Gruzitis et al., 2018), adopting widely acknowledged and cross-lingually applicable represen-
tations: Universal Dependencies (UD) (Nivre et al., 2016), FrameNet (Fillmore et al., 2003), PropBank
(Palmer et al., 2005) and Abstract Meaning Representation (AMR) (Banarescu et al., 2013).
The UD representation is automatically derived from a more elaborated manually annotated hybrid
dependency-constituency representation (Pretkalnina et al., 2018). The FrameNet annotations are manu-
ally added, guided by the underlying UD annotations (see Figure 1). Consequently, frame elements (FE)
are represented by the root nodes of the respective subtrees instead of text spans; the spans can be easily
calculated from the subtrees. The PropBank layer is automatically derived from the FrameNet and UD
annotations (Gruzitis et al., 2020), provided a manual mapping from lexical units (LU) in FrameNet to
PropBank frames, and a mapping from FrameNet FEs to PropBank semantic roles for the given pair of
FrameNet and PropBank frames. Draft AMR graphs are derived from the UD and PropBank layers, as
well as auxiliary layers containing named entity and coreference annotation, with the potential to seam-
lessly integrate the FrameNet frames and FEs into the AMR graphs. The semantically richer FrameNet
annotations (compared to PropBank) are also helpful in acquiring more accurate draft AMR graphs, even
if FrameNet itself stays behind the scenes.
The inspiration to create an integrated multilayer corpus comes from the OntoNotes corpus (Hovy
et al., 2006) and the Groningen Meaning Bank (GMB) (Bos et al., 2017). The overall difference from
1Thank you Lars for your letter of support!
Normunds Grūzītis, Gunta Nešpore-Bērzkalne and Baiba Saulīte. 2022. UD-based Latvian
FrameNet. In Volodina, Dannélls, Berdicevskis, Forsberg and Virk (editors), Live and Learn –
Festschrift in honor of Lars Borin, pages 49–53. Available under CC BY 4.0 49
Figure 1: FrameNet annotation on top of a UD tree. Only head nodes are selected while annotating FEs.
Literal translation of the sentence: “On Wednesday evening at age 79, passed away the nation’s beloved
poet Imants Ziedonis”. The FE spans can be acquired automatically by traversing the respective subtrees:
[tredienas vakar]Time, [taut]Experiencer, [taut mltais dzejnieks Imants Ziedonis]Protagonist. Multi-word LUs
are indicated by generic LU tags: mb aizgjisDEATH versus mltaisEXPERIENCER_FOCUSED_EMOTION.
the OntoNotes approach is that we use the UD model at the treebank layer, and we annotate FrameNet
frames in addition to the PropBank frames. In fact, FrameNet is the primary frame-semantic represen-
tation in our approach. Another difference is that we aim at whole-sentence semantic annotation at the
ultimate AMR layer. This in some sense is similar to the goal of GMB, but GMB uses Discourse Repre-
sentation Theory instead of AMR. For pragmatic reasons, we use the more shallow and more lossy AMR
formalism. Our experience developing semantic parsers and multilingual text generators, by combining
machine learning and grammar engineering (Gruzitis et al., 2017; Gruzitis & Dannélls, 2017; Borin et
al., 2018), has convinced us that FrameNet and AMR both have a great potential to establish as powerful
and complementary semantic interlinguas which can be furthermore strengthened and complemented by
other multilingual frameworks, like Grammatical Framework (Ranta, 2011).
It should be noted that there has been previous work on a domain-specific Latvian FrameNet for a real
life media monitoring use case, focusing on 26 modified Berkeley FrameNet (BFN) frames (Barzdins et
al., 2014). The current work, however, aims at a balanced general-purpose BFN-compliant framenet that
will cover many frequently used frames and LUs.
2 The corpus
We are aiming at a medium-sized treebank/framebank – around 20,000 sentences annotated at all the
layers mentioned in Section 1. Therefore it is crucially important to ensure that the multilayer corpus is
balanced not only in terms of text genres and writing styles but also in terms of LUs.
Our fundamental design decision is that the text unit is an isolated paragraph. The corpus therefore
consists of manually selected paragraphs from many different texts of various types. Representative
paragraphs are selected in different proportions from a balanced 10-million-word text corpus.
As for the LUs, our goal is to cover at least 1,500 most frequently occurring verbs and deverbal nouns,
calculated from the 10-million-word corpus. Since the most frequent verbs tend to be also the most
polysemous, we expect that the number of LUs will be considerably larger – at least 3,000 LUs. We are
aiming at least 10 annotation sets per LU on average.
Paragraphs to be annotated are selected based on target words they contain, not randomly, and curators
are constantly updated on the current balance or imbalance of the corpus w.r.t. text genres and target word
frequencies.
Currently, we have acquired more than 20,000 annotation sets, covering more than 500 BFN frames
evoked by more than 2,500 LUs.2
3 The FrameNet annotation process
Paragraphs for which the manual treebank annotation is finalized and which have been successfully
converted to the UD representation are considered for the FrameNet annotation. Unfinished paragraphs
are ignored till next iteration, since their sentence split, tokenization, as well as tree structure can still
considerably change. Changes in the tree structure are not a major issue, and the FrameNet annotation
process actually helps to spot and eliminate many inconsistencies in the underlying trees. The sentence
splitting and tokenization, however, is a major requirement to later avoid issues in merging the different
annotation layers.
2https://github.com/LUMII-AILab/FullStack
50
3.1 The concordance approach
While treebank, named entity and coreference annotations are done paragraph by paragraph and sentence
by sentence, we do not find this being a productive workflow for annotating semantic frames, especially
in case of the highly abstract FrameNet frames. Instead, we prefer a concordance view, so that the linguist
can focus on a target word and its different senses, without constantly switching among different sets of
frames. This improves the annotation consistency.
When more paragraphs are finalized at the UD layer, they are included in the next concordance queries.
The first concordance is processed when there are at least three example sentences available for the target
word. The next concordances are collected and processed on considerable milestones at the UD layer. The
annotated concordances from the first rounds serve as guidelines when annotating the next rounds, thus,
further improving consistency.
A consequence of such approach is that no full-text annotation is intentionally done, although many
sentences might become close-to-fully annotated after merging annotations of the same sentence from
different concordances.
3.2 The UD-based annotation
The UD-based approach has a significant consequence: FEs are not annotated as spans of text – annota-
tors select only the head word (node) when annotating an FE. The whole span can be easily calculated
automatically by traversing the respective UD subtree. These calculations are not included as part of the
data set.
Such approach not only makes the annotation process more simple and the annotations more consistent,
but it also facilitates the training of an automatic semantic role labeler, since it is easier to identify the
syntactic head of an FE than a span of a string. Still, most FrameNet corpora are annotated in terms of
spans, relying on syntactic parsing as a post-processing step.
3.3 Important notes on frame elements
Yet another important decision regarding FEs is to currently focus only on the core elements according
to BFN. We have made this decision because of the limited resources. However, we do annotate two
non-core elements systematically: Time and Place (as illustrated in Figure 1). In various information
extraction use cases (e.g. for media monitoring), these two non-core FEs are important. Other non-core
elements are annotated occasionally, if they are rather specific to the frame (e.g. non-core indirect objects
and specific adverbial modifiers).
Regarding null instantiations (NI), we do not annotate missing FEs in the sentence. This is out of the
scope of the current project, but the annotation of NI should to be considered in a follow-up research:
(i) since the FrameNet annotation is relaying on UD, it is an open question how to handle NI – where
to attach these annotations; (ii) since Latvian is a highly inflected language, the grammatical subject and
object can be omitted in a sentence, to some extent, compensating it with the respective form of the
verb; (iii) in general, it would require Latvian-specific guidelines, but the theoretical foundations are not
mature yet for Latvian; it would require more elaborate linguistic research, based on the basic annotated
data acquired in the current project; (iv) although NI is highly relevant for lexicographic research, it is
not a priority for many practical use cases that require semantic parsing.
3.4 Multi-word lexical units
To deal with target words as as multi-word units, we have introduced an auxiliary annotation layer for
multi-word LUs (as illustrated in Figure 1). The head word is still a verb or deverbal noun that evokes
a frame, but the other key constituents are indicated as well. Again, note that these constituents may be
root nodes of some subtrees – we do not annotate the whole spans.
This auxiliary layer is not an ultimate solution to deal with constructions, but for now it allows us to
register such cases and to retrieve them later for more elaborated analysis. Usually these are partially
grammaticalized constructions or even idioms that, as a whole, evoke the respective frames. If we would
consider these verbs in isolation, they would rather evoke different frames.
51
3.5 Cross-lingual issues
In order to ensure compliance with BFN and, thus, to maximize the cross-lingual applicability of Latvian
FrameNet, we are strictly sticking to the BFN frame inventory. We avoid defining any Latvian-specific
frames. Therefore it is sometimes difficult to select an appropriate BFN frame for a particular sense of a
Latvian verb. It usually happens when:
1. The sense of a Latvian verb is more specific compared to the closest English verb sense or compared
to the definition of the closest BFN frame. For instance, for the verb prdomt ‘to change one’s mind’
or ‘to rethink’, we do not have a solution yet, since BFN frames related to thinking (Opinion, Cog-
itation) do not fit this verb sense, and neither does the general Cause_change frame. Similarly, we
have not found a good mapping for maldties ‘to be wrong’ and saemties ‘to pull oneself together’.
2. The sense of a Latvian verb is more general compared to the closest English verb sense: the sense of
an English verb is expressed in Latvian by a phrase (typically, by a verb and a direct object). Exam-
ples: last lekciju ‘to lecture’ (‘to give a lecture’), krist bon ‘to faint’ (‘to fall into unconsciousness’),
zaudt samau ‘to faint’ (‘to lose consciousness’).
3. The semantic elements are different between the Latvian and English verb senses. For instance,
braukt ‘to move using a vehicle’: the sense of the Latvian verb does not specify whether the person
is a driver or a passenger (e.g. es braucu uz darbu ‘I go to work (by a transport)’ – it is unclear
what is the role of the person, and which frame is evoked – Ride_vehicle or Operate_vehicle. In this
particular case, we use the frame Use_vehicle which is a non-lexical frame in English.
There are some options for how to deal with these issues: (i) by treating more verb phrases in Latvian
as if they were multi-word LUs, even if lexicographers would argue about that; (ii) by using a more
general BFN frame if possible, i.e., if the direct object of the target verb can be annotated as a core FE
(e.g., it would work for ‘to lose consciousness’ but not for ‘to give a lecture’); (iii) some frames are just
missing in BFN, and a global solution would be needed on how to propose and confirm new frames in
the BFN frame hierarchy; most likely in the scope of the Multilingual FrameNet initiative (Gilardi &
Baker, 2018).
4 Conclusion
Creating the Latvian FrameNet, we strictly follow a corpus-driven approach: no LUs are introduced
without annotated examples, i.e., we create no LUs based on lexicographic intuition or a common-sense
dictionary; only based on corpus evidence. An initial experiment on bootstrapping LUs without corpus
evidence did not prove to be productive: many of those hypotheses are not confirmed by our corpus (at
least for now), and vice versa – many LUs were missing.
The consecutive treebank and framebank annotation workflow has turned out very productive and
mutually beneficial. The dependency tree facilitates the annotation of semantic frames and roles, while
the frame semantic analysis of the verb valency often unveils various inconsistencies and bugs in the
dependency or morphological annotation.
Acknowledgements
This work has received financial support from the European Regional Development Fund under the grant
agreement No. 1.1.1.1/16/A/219 and is continued within the State Research Programme under the grant
agreement No. VPP-LETONIKA-2021/1-0006.
References
Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp
Koehn, Martha Palmer, & Nathan Schneider. 2013. Abstract Meaning Representation for Sembanking. In
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 178–186,
Sofia.
52
Guntis Barzdins, Didzis Gosko, Laura Rituma, & Peteris Paikens. 2014. Using C5.0 and exhaustive search for
boosting frame-semantic parsing accuracy. In Proceedings of the 9th LREC Conference, pages 4476–4482.
Lars Borin, Dana Dannélls, & Normunds Gruzitis, 2018. Linguistics vs. language technology in constructicon
building and use, pages 229–254. John Benjamins.
Johan Bos, Valerio Basile, Kilian Evang, Noortje Venhuizen, & Johannes Bjerva. 2017. The Groningen Meaning
Bank. In Nancy Ide & James Pustejovsky, editors, Handbook of Linguistic Annotation, volume 2, pages 463–
496. Springer.
Charles J. Fillmore, Christopher R. Johnson, & Miriam R.L. Petruck. 2003. Background to FrameNet. Interna-
tional Journal of Lexicography, 16(3):235–250.
Luca Gilardi & Collin Baker. 2018. Learning to Align across Languages: Toward Multilingual FrameNet. In
International FrameNet Workshop 2018: Multilingual FrameNets and Constructicons, Miyazaki.
Normunds Gruzitis & Dana Dannélls. 2017. A multilingual FrameNet-based grammar and lexicon for Controlled
Natural Language. Language Resources and Evaluation, 51(1):37–66.
Normunds Gruzitis, Didzis Gosko, & Guntis Barzdins. 2017. RIGOTRIO at SemEval-2017 Task 9: Combin-
ing machine learning and grammar engineering for AMR parsing and generation. In Proceedings of the 11th
International Workshop on Semantic Evaluation, pages 924–928.
Normunds Gruzitis, Lauma Pretkalnina, Baiba Saulite, Laura Rituma, Gunta Nespore-Berzkalne, Arturs Znotins,
& Peteris Paikens. 2018. Creation of a Balanced State-of-the-Art Multilayer Corpus for NLU. In Proceedings
of the 11th LREC Conference.
N. Gruzitis, R. Dargis, L. Rituma, G. Nespore-Berzkalne, & B. Saulite. 2020. Deriving a propbank corpus from
parallel framenet and ud corpora. In Proceedings of the International FrameNet Workshop 2020: Towards a
Global, Multilingual FrameNet, pages 63–69.
Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, & Ralph Weischedel. 2006. OntoNotes: The
90% Solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion
Volume: Short Papers, pages 57–60, New York City.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajič, Christopher D. Manning,
Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, & Daniel Zeman. 2016. Univer-
sal Dependencies v1: A Multilingual Treebank Collection. In Proceedings of the 10th LREC Conference, pages
1659–1666.
Martha Palmer, Daniel Gildea, & Paul Kingsbury. 2005. The Proposition Bank: An Annotated Corpus of Semantic
Roles. Computational Linguistics, 31(1):71–106.
L. Pretkalnina, L. Rituma, & B. Saulite. 2018. Deriving enhanced universal dependencies from a hybrid
dependency-constituency treebank. In Text, Speech, and Dialogue, volume 11107, pages 95–105. Springer.
Aarne Ranta. 2011. Grammatical Framework: Programming with Multilingual Grammars. CSLI Publications,
Stanford.
53

The rise and fall of grammatical theories in descriptive grammars of the
languages of the world
Harald Hammarström
Department of Linguistics and Philology
University of Uppsala, Sweden
harald.hammarstrom@lingfil.uu.se
Abstract
The present study traces the popularity of various grammatical theories in descriptive grammars
of the languages of the world. Using the DReaM corpus of grammatical descriptions, we may
simply look for the existence of simple terms relating to various descriptive theories across time.
Even such a relatively elementary investigation does show that there is an, over time, increasing
explicit interest in theory but specific theories “come and go” relatively quickly.
1 Introduction
The present study traces the popularity of various grammatical theories through time in descriptive gram-
mars of the languages of the world. As such it intersects various interests of Lars Borin, such as cultur-
omics (Tahmasebi et al., 2015), grammar formalisms (Borin & Saxena, 2004), linguistic typology (Virk
et al., 2017) and corpus linguistics (Borin et al., 2012).
Writing a descriptive grammar involves the analysis of primary linguistic data (Karlsson, 2005; Chel-
liah & de Reuse, 2011, 7-24). As has been repeatedly pointed out, some theory is necessary for this task
(Rice, 2006; Dryer, 2006) but theories can be more or less explicit and more or less restrictive. Thanks to
the appearance of the DReaM Corpus (Virk et al., 2020) — a collection of digitized grammatical descrip-
tions of the languages of the world — we are now able to shed light on the use of grammatical theories
in descriptive work through time. Apart from qualitative remarks in passing (Chelliah & de Reuse, 2011;
Sakel & Everett, 2012, 152-158) these trajectories have not been compared in previous work.
For the experiments in the present study, we have searched the archive of over 10,911 grammatical de-
scriptions (grammars and grammar sketches) in the DReaM corpus spanning languages all over the world
from 1250 AD to the present (Virk et al., 2020). Even if not explicitly mentioned, the searches have been
done inclusive of synonyms, spelling variants, morphological variants and OCR errors (Hammarström et
al., 2017).
2 Experiments
The first question is to what extent grammars are explicit about theory at all. For this we may simply
search for the term ‘theory’, and its equivalents in a few other European language across grammars.
For example, the term ‘theory’ occurs no less than 308 times in Saxon (1986)’s English dissertation on
Dogrib [dgr], ‘théorie’ occurs 5 times in Alexandre (1966)’s French description of Bulu [bum] and 0
times in von Hagen (1914)’s German description of the same language. Figure 1 shows the proportion
of grammars in which the theory-word occurs at least once through the timespan 1850-2010. As can be
seen, there is a trend towards explicit mentions of theory, from approximately 20% in 1850 to over 50%
at present (with few appreciable differences between meta-languages).
To gauge the popularity of different specific descriptive frameworks we searched through the English-
subset (7,816 grammars) for a (non-exhaustive) selection of influential theories. Figure 2 shows the
proportion of grammars which mention a given theory at least once through the timespan 1900-2010.
Harald Hammarström. 2022. The rise and fall of grammatical theories in descriptive grammars
of the languages of the world. In Volodina, Dannélls, Berdicevskis, Forsberg and Virk (editors),
Live and Learn – Festschrift in honor of Lars Borin, pages 55–59. Available under CC BY 4.0 55
Figure 1: The proportion of grammars in which the term ‘theory’, or its equivalent in other languages,
occurs at least once through the timespan 1850-2010. The data has been binned into 10-year intervals.
Figure 2: The proportion of grammars which mention a given theory (or an equivalent name or spelling
variant) at least once through the timespan 1900-2010. The data has been binned into 10-year intervals.
56
The term tagmem(ic) — originating from Bloomfield (1935)— has a few mentions before, but the first
grammar sketch written in the Tagmemic framework developed by Pike (1954–1960) is Duff (1959)’s
sketch of Yanesha’ [ame], cf. Waterhouse (1974, 90-92). A range of Tagmemic grammars produced by
members of the Summer Institute of Linguistics followed, peaking in production around 1970 — when
approximately 1/8 grammars mention Tagmemics — but has since faded in popularity.
An almost identical rise-and-fall curve is manifested in Transformational grammars, associated
with Harris (1957) and Chomsky (1957), whose first grammar sketch appears to be Apte (1962)’s sketch
of Marathi [mar]. The first explicitly Generative grammar is Sleator (1957)’s description of English
[eng] of Jackson County, Indiana followed by further Indiana University dissertations. In the beginning
Transformational is almost synonymous with Generative. Although not logically necessary, there
is in fact near-full empirical overlap between these terms. However, after the 1970s, the Generative
umbrella continues with other exponents. At its peak, 1/6 grammars were Generative-Transformational.
Lexical-Functional Grammar (LFG) was developed in the 1970s (Dalrymple et al., 2019, 1-2)
but it is not until Davies (1981)’s grammar of Choctaw [cho] that there is a grammar written in this
framework.
Around the same time is Role and Reference Grammar (RRG) whose first witness is a dissertation
on Lakota [lkt] by an author who co-outlined the theory itself in the same year (Van Valin, 1977, 1).
A decade later is Construction Grammar — originally developed for modeling idioms (Fillmore
et al., 1988) — whose first explicit witness is Watters (1988, 15)’s dissertation on Tlachichilco Tepe-
hua [tpt]. LFG, RRG and Construction Grammar have in common that they are minority theories which
nevertheless continue to gain popularity, even into the present, in terms of proportional mentions.
Finally, Basic Linguistic Theory, most extensively articulated by Dixon (2010), is first adopted
under that name in Hanafi (1997)’s work on Sundanese [sun] (crediting Dixon’s 1996 lecture series
in Canberra that underlie the later publication). It is since the fastest growing and the currently most
popular explicitly mentioned descriptive theory, occurring in approximately 1/7 grammars. Arguably,
many, perhaps most, other descriptions in the past have used a naive version of Basic Linguistic Theory.
The new development lies in the explicit use of an extensively developed version thereof.
Figure 2 also contains a curve for the proportion of grammars that mention at least one of the afore-
mentioned theories at least once. Compared to Figure 1 it exhibits a dipping point in the curve around
1990, i.e., mentions of theory increase steadily, but the specific theories inspected here come in waves.
3 Conclusion
Thanks to the appearance of the DReaM corpus we were able to quantify some trends relating to the-
oretical framework for grammatical description through time. Although shallow, these investigations
do reinforce the widely held impression that restrictive descriptive theories “come and go” (Aikhenvald,
2015, 6-7).
References
Alexandra Y. Aikhenvald. 2015. The art of grammar: a practical guide. Oxford: Oxford University Press.
Pierre Alexandre. 1966. Système Verbal et Prédicatif du Bulu, volume 1 of Langues et Littératures de l’Afrique
Noire. Paris: Librairie C. Klincksieck.
Mahadeo Laxman Apte. 1962. A sketch of Marathi transformational grammar. Ph.D. thesis, Madison: University
of Wisconsin.
Leonard Bloomfield. 1935. Language. George Allen & Unwin: London.
Lars Borin & Anju Saxena. 2004. Grammar, incorporated. In Henrichsen, P. J. (ed). CALL for the Nordic
languages, pages 125–145. Samfundslitteratur, Frederiksberg.
Lars Borin, Markus Forsberg, & Johan Roxendal. 2012. Korp the corpus infrastructure of språkbanken. In
Proceedings of LREC 2012. Istanbul: ELRA, pages 474–478.
57
Shobhana L. Chelliah & Willem J. de Reuse. 2011. Handbook of Descriptive Linguistic Fieldwork. Dordrecht:
Springer.
Noam Chomsky. 1957. Syntactic Structures. The Hague: Mouton.
Mary Dalrymple, John J. Lowe, & Louise Mycock. 2019. The Oxford Reference Guide to Lexical Functional
Grammar. Oxford: Oxford University Press.
William D. Davies. 1981. Choctaw Clause Structure. Ph.D. thesis, University of California at San Diego.
R.M.W. Dixon. 2010. Basic Linguistic Theory. Oxford: OUP. 2 vols.
Matthew S. Dryer. 2006. Descriptive theories, explanatory theories, and basic linguistic theory. In Felix Ameka,
Alan Dench, & Nicholas Evans, editors, Catching Language: Issues in Grammar Writing, pages 207–234.
Berlin: Mouton de Gruyter.
Martha Duff. 1959. Amuesha (arawak) syntax i: simple sentence types / sintaxe amuexa (arawak) i: sentenças do
tipo simples. In Publicações do Museu Nacional, volume 1 of Série Lingüistica Especial, pages 172–237. Rio
de Janeiro: Museu Nacional.
Charles J. Fillmore, Paul Kay, & Mary Catherine O’Connor. 1988. Regularity and idiomaticity in grammatical
constructions: The case of let alone. Language, 64(3):501–538.
Harald Hammarström, Shafqat Mumtaz Virk, & Markus Forsberg. 2017. Poor man’s ocr post-correction: Unsu-
pervised recognition of variant spelling applied to a multilingual document collection. In Proceedings of the
Digital Access to Textual Cultural Heritage (DATeCH) conference, pages 71–75. Göttingen: ACM.
Nurachman Hanafi. 1997. A typological study of Sundanese. Ph.D. thesis, LaTrobe University.
Zellig S. Harris. 1957. Co-occurrence and transformation in linguistic structure. Language, XXXIII:283–340.
Fred Karlsson. 2005. Nature and methodology of grammar writing. SKY Journal of Linguistics, 18:341–356.
Kenneth L. Pike. 1954-1960. Language in relation to a unified theory of the structure of human behavior. Santa
Ana, California: Summer Institute of Linguistics. 3 vols.
Keren Rice. 2006. Let the language tell its story? the role of linguistic theory in writing grammars. In Felix
Ameka, Alan Dench, & Nicholas Evans, editors, Catching Language: Issues in Grammar Writing, pages 235–
268. Berlin: Mouton de Gruyter.
Jeanette Sakel & Daniel L. Everett. 2012. Linguistic Fieldwork: A Student Guide. Cambridge: Cambridge
University Press.
Leslie Saxon. 1986. The Syntax of Pronouns in Dogrib: Some Theoretical Consequences. Ph.D. thesis, University
of California at San Diego.
Mary Dorothea Sleator. 1957. Phonology and morphology of an American English dialect. Ph.D. thesis, Indiana
University.
Nina Tahmasebi, Lars Borin, Gabriele Capannini, Devdatt Dubhashi, Peter Exner, Markus Forsberg, Gerhard
Gossen, Fredrik Johansson, Richard Johansson, Mikael Kågebäck, Olof Mogren, Pierre Nugues, & Thomas
Risse. 2015. Visions and open challenges for a knowledge-based culturomics. International Journal on Digital
Libraries, 15(2-4):169–187.
Robert D. Van Valin. 1977. Aspects of Lakhota Syntax: A Study of Lakhota (Teton Dakota) Syntax and its
Implications for Universal Grammar. Ph.D. thesis, University of California, Berkeley.
Shafqat Virk, Lars Borin, Anju Saxena, & Harald Hammarström. 2017. Automatic extraction of typological lin-
guistic features from descriptive grammars. In Text, Speech, and Dialogue 20th International Conference, TSD
2017, Prague, Czech Republic, August 27-31, 2017, Proceedings / edited by Kamil Ekstein, Václav Matousek.,
Cham. Springer International Publishing.
Shafqat Mumtaz Virk, Harald Hammarström, Markus Forsberg, & Søren Wichmann. 2020. The dream corpus:
A multilingual annotated corpus of grammars for the worlds languages. In Proceedings of The 12th Language
Resources and Evaluation Conference, pages 871–877. European Language Resources Association, Marseille.
Gunther Tronje von Hagen. 1914. Lehrbuch der Bulu Sprache. Berlin: Druck und Verlag von Gebr. Rodetzki
Hofbuchhandlung, Berlin.
58
Viola G. Waterhouse. 1974. The history and development of tagmemics. The Hague: Mouton.
James Kenneth Watters. 1988. Topics in Tepehua Grammar. Ph.D. thesis, University of California at Berkeley.
59

Coveting your neighbor’s wife: Using lexical neighborhoods in
substitution-based word sense disambiguation
Richard Johansson
Department of Computer Science and Engineering
University of Gothenburg and Chalmers University of Technology, Sweden
richard.johansson@gu.se
Abstract
We explore a simple approach to word sense disambiguation for the case where a graph-structured
lexicon of word sense identifiers is available, but no definitions or annotated training examples.
The key idea is to consider the neighborhood in a lexical graph to generate a set of potential
substitutes of the target word, which can then be compared to a set of substitutes suggested by a
language model for a given context. We applied the proposed method to the SALDO lexicon for
Swedish and used a BERT model to propose contextual substitutes. The system was evaluated on
sense-annotated corpora, and despite its simplicity we see a strong improvement over previously
proposed models for unsupervised SALDO-based word sense disambiguation.
1 Introduction
Probabilistic language models estimate the probability of a word occurring in a given context. This means
that for an observed occurrence of a word, a language model can suggest other words – substitutes – that
could potentially have occurred instead. With a high-quality language model, the set of potential substi-
tutes reflects the sense of the word in that specific context. This intuition suggests a simple mechanism
for the task of word sense disambiguation (WSD) where our goal is to link each occurrence to an item
in a fixed sense inventory defined by a lexicon: assuming that the lexicon allows us to generate a set of
potential substitutes for each sense, we can then simply compare each of these lists to the one we got
from the language model. To disambiguate, we then select the lexicon sense where the substitute set is
most similar to the language model’s set of substitutes.
How can we use a lexicon to generate a set of potential substitutes of a given sense? This depends on
what information the lexicon represents and how it is structured. In this work, we assume that the lexicon
is graph-structured and that proximity in the graph corresponds to substitutability; this assumption allows
us to generate a set of potential substitutes of a given sense by considering its neighborhood in the graph.
To exemplify, let us assume that we are given the following two occurrences of the Swedish word
ämne and that we want to associate them with a sense in the SALDO lexicon (Borin et al., 2013):
(1) Detta ämne är frätande. (2) Detta ämne kommer att diskuteras senare.
‘This substance is corrosive.’ ‘This topic will be discussed later.’
For the first case, the five most probable substitutes suggested by a BERT model are innehåll ‘content’,
gift ‘poison’, område ‘area’, medel ‘agent’, föremål ‘object’; for the second case, they are område ‘area’,
problem ‘problem’, språk ‘language’, tema ‘theme’, förslag ‘proposal’.
We then consider the neighborhoods in the lexicon graph. SALDO defines four senses of ämne. Sense
1 corresponds to ‘substance’ and its immediate neighborhood overlaps with the substitute set for the first
example: there is an edge in the SALDO graph between sense 1 of ämne and sense 1 of gift, so we can
link the first occurrence to sense 1. Similarly, sense 2 in SALDO corresponds to ‘topic’ and there is an
edge between this sense and sense 1 of tema, allowing us to disambiguate the second occurrence as well.
Richard Johansson. 2022. Coveting your neighbor’s wife: Using lexical neighborhoods in
substitution-based word sense disambiguation. In Volodina, Dannélls, Berdicevskis, Forsberg
and Virk (editors), Live and Learn – Festschrift in honor of Lars Borin, pages 61–66. Available
under CC BY 4.0 61
(a) Neighborhood of ämne..1. (b) Neighborhood of ämne..2.
Figure 1: Fragments of SALDO neighborhoods for two of the senses of ämne. Primary descriptor edges
are drawn as solid arrows and secondary descriptor edges as dashed arrows.
2 The SALDO lexicon
The SALDO lexicon (Borin et al., 2013) defines a large sense inventory for Swedish words. While a
number of other large-scale lexical resources for Swedish have been developed, SALDO is the largest
open resource. It is an extended version of the SAL lexicon (Lönngren, 1989; Borin, 2005) and has been
used as a pivot lexicon to define mappings between several lexical-semantic resources in Swedish (Borin,
2010), for instance in the Swedish FrameNet++ project (Borin et al., 2010).
Borin & Forsberg (2009) discuss the conceptual differences between SALDO andWordNet (Fellbaum,
1998). A major difference between these resources is that SALDO tends to use a more coarse-grained
sense inventory compared to WordNet. Another fundamental difference is that SALDO does not define
typed lexical-semantic relations (e.g. synonymy, is-a, hyponymy) between word senses but instead relies
on the notion of association (Borin et al., 2013). Association can correspond to several types of lexical-
semantic relationships: in many cases, an associated sense can be a synonym or hyperonym, but in other
cases it can be e.g. a meronym or be in a predicate–argument relationship.
While each sense could in principle be in an association relationship with many other senses, SALDO
explicitly encodes relationships between each sense and its primary descriptor (PD): an associated sense
that has a more primitive meaning. A few additional relationships are encoded as secondary descriptors.
SALDO includes no other lexical-semantic information apart from these relations, such as sense defini-
tions or contextual examples. Figure 1 shows the neighborhoods in the SALDO graph around the two
senses of ämne discussed in the introduction.
3 Previous work
Disambiguation systems are implemented in different ways depending on what resources are available.
For WordNet-based WSD in English, the most systems tend to use supervised learning because of the
availability of moderately large annotated datasets. WordNet is also a fairly rich resource and includes
definitions, glosses, as well as several types of labeled sense-to-sense relations. In contrast, SALDO-
based WSD is more challenging because of the small quantity of available annotated data and the sparse
information in the lexicon. For this reason, most of the WSD systems using SALDO rely on the structure
of the lexicon graph only, sometimes in combination with representations learned from unannotated text.
Johansson & Nieto Piña (2015b) proposed a method to align SALDO senses with a word embedding
model; this approach naturally leads to a disambiguation mechanism (Johansson & Nieto Piña, 2015a).
A tool using this disambiguation method is now integrated in the Sparv annotation pipeline (Borin et al.,
2016). Nieto Piña & Johansson (2017) used a graph-based regularizer to train word and sense embed-
dings jointly. Purely graph-basedWSD approaches requiring no corpora include graph embeddings using
random walks (Nieto Piña & Johansson, 2016b) and personalized PageRank (Agirre & Soroa, 2009).
Nieto Piña & Johansson (2016a) evaluated severalWSD systems on all SALDO-annotated corpora that
were available at the time. The system by Johansson & Nieto Piña (2015b) was the most effective of those
using no training data, but a comparison with a supervised system (on a limited set of target lemmas for
which annotated data was available) showed that the unsupervised systems performed relatively poorly.
The idea of disambiguating word senses by using language models to suggest potential substitutes was
62
first proposed by Başkaya et al. (2013), who applied this approach for WordNet-based WSD as well as
for lexicon-free word sense induction (WSI). Subsequent work has mostly focused on WSI: for instance,
Amrami & Goldberg (2018) applied a pair of language models to generate substitute sets for WSI.
The same group later used a BERT model for substitute set generation (Amrami & Goldberg, 2019)
and this approach is the state of the art in WordNet-based WSI for English as of 2022 (Eyal et al.,
2022). The pre-training of BERT (Devlin et al., 2019) involves (among other things) training a masked
language model (MLM) that tries to predict the identity of a hidden word in a given context, and this
aligns perfectly with our goals since a substitute set can then be generated simply by applying the MLM.
4 Selecting a SALDO sense for an ambiguous word
Assuming that we are given a context, the position of a word to disambiguate, and a set of SALDO senses
to select from, we compute a weighted contextual set of substitute words (§4.1) as well as a weighted
word set based on the SALDO neighborhood for each sense (§4.2). We then compute the cosine similarity
between the contextual set to each of the SALDO-based sets and select the highest-scoring sense.
4.1 Proposing contextual substitutes
We follow the most recent work in substitution-based WSI and apply the MLM of a BERT model. We
used the Swedish BERT model published by the Swedish Royal Library (Malmsten et al., 2020). Follow-
ing Eyal et al. (2022), the MLM is applied in a straightforward manner without masking or modifying
the text. We compute the probability distribution at the target position, select the 200 top-scoring items,
and exclude inflections of the original target word. The set of potential substitute tokens are weighted
proportionally to the probability assigned by the MLM.
While the application of BERT is quite straightforward, the probability distributions are affected by
the word piece tokenization. For instance, if a token is followed by a suffix word piece (e.g. ##ar), the
MLM will assign high probabilities mainly to prefixes likely to be followed by this suffix. This likely
causes the substitute sets to be of poorer quality for less frequently occurring words and precludes the
use of the approach for the disambiguation of multiword expressions. In this work, we simply removed
suffix word pieces (starting with ##) from the set of substitutes; the development of a more systematic
approach could potentially be explored in later work.
4.2 Extracting neighborhoods from SALDO
We use the neighborhood extraction approach proposed by Nieto Piña & Johansson (2017). For a given
SALDO sense, we extract its immediate neighbors in the SALDO graph, following primary and sec-
ondary descriptor edges in both directions. Since our goal is to produce a list of words that could poten-
tially be substituted, we only include senses of words of the same grammatical category as the original
sense. We repeat the process and add parents, children, and siblings to the set until it has a size of at least
16. Finally, we use the morphological lexicon of SALDO to map every sense to a set of inflected forms,
so that e.g. gift..1 results in gift, giftet, . . . , giftens. The items are assigned weights that depend on the
distance in the SALDO graph.
5 Experiments
The largest sense-annotated resource for Swedish was developed in the SemTag project (Järborg, 1999);
this covers most of the Stockholm–Umeå corpus (Ejerhed et al., 1992). However, this resource does
not use SALDO to define its senses, although SALDO has imported some senses from SemTag lexicon.
The Swedish lexical sample of the SENSEVAL-2 shared task (Kokkinakis et al., 2001) used a subset of
the SemTag resource consisting of annotation for 40 ambigous lemmas. The senses for these lemmas
were manually mapped to SALDO by Nieto Piña & Johansson (2016a). Since SALDO uses a coarser
division into senses than SemTag, three of the lemmas were not ambiguous after this lexicon mapping
and they were removed from the dataset. The only running-text corpus annotated with SALDO senses is
Eukalyptus (Johansson et al., 2016), which includes texts from eight different domains.
63
Method SENSEVAL-2 Eukalyptus
Substitutes 0.6675 0.7020
J & NP (2015) 0.4976 –
Random baseline 0.3557 0.4094
Lowest-sense baseline 0.4952 0.6580
Supervised (BoW) 0.8033 –
Supervised (BERT) 0.9209 –
Table 1: Disambiguation results on the test sets for the different methods.
The instances were preprocessed using the Sparv pipeline (Borin et al., 2016). For each word, the
pipeline proposes a set of possible SALDO senses, based on the automatically determined morphological
analysis and lemmatization. The sense disambiguator chooses one of the candidates from this set.
Unambiguous words are excluded from the experiment, which means that the practical accuracy is
higher than what we report in the next section, since the majority of the words are unambiguous. We
also exclude cases where the annotated sense is a non-compositional reading of a multi-word expression
(e.g. på örat intended as ‘drunk’, not as ‘on the ear’) or a compositional reading of a compound. After
this preprocessing, the SENSEVAL-2 sample consists of a test set of 1,366 instances and a training set
of 7,790 instances, and the Eukalyptus set of 12,434 instances.
5.1 Results
We evaluated the substitute-based approach proposed in this paper and compare it to a number of trivial
and nontrivial baselines. Table 1 shows the disambiguation accuracies on the two test sets. The accuracies
are macro-averaged over the 37 lemmas for SENSEVAL-2 and micro-averaged for Eukalyptus.
The most meaningful comparison is with the method by Johansson & Nieto Piña (2015a), which is
included in Sparv: this system uses a similar setup with a combination of the SALDO graph and a
representation model trained in an unsupervised fashion. As we can see, the substitute-based method per-
forms much better on the SENSEVAL-2 test set. Both methods outperform two trivial baselines: random
selection, and selecting the sense with the lowest numerical identifier. The substitute-based method also
outperforms the lowest-sense baseline on the Eukalyptus set.
For SENSEVAL-2, we also evaluate two straightforward supervised approaches that learn from an-
notated training examples: a linear SVM using a bag-of-words representation, and a MLP on a BERT
representation. Both were implemented as “word experts” that use one classifier per base form. All graph-
based methods are strongly outperformed by the supervised models. Practically, the supervised approach
cannot be applied to Eukalyptus because of the Zipfian distribution of lemmas to disambiguate.
6 Discussion
The proposed method works surprisingly well compared to the baselines despite its simplicity. The
method is also quite cheap: in the implementation we have described here, we have only used the graph-
based neighborhood, although in the general case it may be possible to exploit other lexical-semantic
information to generate more accurate substitute sets. No annotated examples for training are needed.
While the performance is better than previous purely graph-based WSD approaches using SALDO,
it is much lower than for supervised models in a lexical sample setting. Obviously, a supervised word
expert approach is more difficult to apply in a running-text setting, e.g. in Eukalyptus. Another important
practical consideration is the flexibility of the substitute-based method: if we add a new sense to the
lexicon and update the edges accordingly, we can immediately use the new sense in the disambiguator.
The method can therefore be argued to be applicable in an interactive fashion.
This is a first attempt and we see a potential for a more careful consideration of the graph-based
substitute set, the contextual substitutes, and the way that these sets are compared. The whole idea hinges
on being able to use the lexical resource to suggest potential substitutes. For SALDO, this works less
64
well in some cases where the neighborhood structure does not correspond well to substitutability. Words
referring to professions is one such case; cf. the discussion by Johansson (2014).
More generally, we may want to develop methods that align token representations from a language
model with a representation of the graph. One might use an embedding of the SALDO graph, either a
purely graph-based embedding (Nieto Piña & Johansson, 2016b) or one based on a combination of the
graph and a corpus (Johansson & Nieto Piña, 2015b; Nieto Piña & Johansson, 2017). It may then be
possible to build a mapping of the BERT-based representation into the space of the embedded graph.
Acknowledgements
This work builds on resources developed in projects running over several decades at Språkbanken and
Uppsala University: the development of the SALDO lexicon and the sense-annotated corpora in the
SemTag and Eukalyptus projects. I was funded by the projects Interpreting and Grounding Pre-trained
Representations for NLP and Representation Learning for Conversational AI, both funded by WASP.
References
Eneko Agirre & Aitor Soroa. 2009. Personalizing PageRank for Word Sense Disambiguation. In Proceedings of
the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 33–41, Athens.
Asaf Amrami & Yoav Goldberg. 2018. Word Sense Induction with Neural biLM and Symmetric Patterns. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4860–4867,
Brussels. Association for Computational Linguistics.
Asaf Amrami & Yoav Goldberg. 2019. Towards better substitution-based word sense induction. arXiv preprint
1905.12598, https://arxiv.org/pdf/1905.12598.pdf.
Osman Başkaya, Enis Sert, Volkan Cirik, & Deniz Yuret. 2013. AI-KU: Using Substitute Vectors and Co-
Occurrence Modeling For Word Sense Induction and Disambiguation. In Second Joint Conference on Lexical
and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Seman-
tic Evaluation (SemEval 2013), pages 300–306, Atlanta, Georgia. Association for Computational Linguistics.
Lars Borin & Markus Forsberg. 2009. All in the family: A comparison of SALDO and WordNet. In Proceedings
of the Nodalida 2009Workshop onWordNets and other Lexical Semantic Resources - between Lexical Semantics,
Lexicography, Terminology and Formal Ontologies. NEALT Proceedings Series, volume 7.
Lars Borin, Dana Dannélls, Markus Forsberg, Maria Toporowska Gronostaj, & Dimitrios Kokkinakis. 2010. The
Past Meets the Present in the Swedish FrameNet++. In Proceedings of EURALEX.
Lars Borin, Markus Forsberg, & Lennart Lönngren. 2013. SALDO: a touch of yin to WordNet’s yang. Language
Resources and Evaluation, 47(4):1191–1211.
Lars Borin, Markus Forsberg, Martin Hammarstedt, Dan Rosén, Roland Schäfer, & Anne Schumacher. 2016.
Sparv: Språkbanken’s corpus annotation pipeline infrastructure. In Swedish Language Technology Conference,
Umeå.
Lars Borin. 2005. Mannen är faderns mormor: Svenskt associationslexikon reinkarnerat. LexicoNordica, 12:39–
54.
Lars Borin. 2010. Med Zipf mot framtiden - en integrerad lexikonresurs för svensk språkteknologi. LexicoNordica,
17.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, & Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirec-
tional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and
Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Eva Ejerhed, Gunnel Källgren, Ola Wennstedt, & Magnus Åström. 1992. The linguistic annotation system of
the Stockholm-Umeå corpus project – description and guidelines. Technical report, Department of Linguistics,
Umeå University.
Matan Eyal, Shoval Sadde, Hillel Taub-Tabib, & Yoav Goldberg. 2022. Large Scale Substitution-based Word
Sense Induction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 4738–4752, Dublin. Association for Computational Linguistics.
65
Christiane Fellbaum, editor. 1998. WordNet: An electronic lexical database. MIT Press.
Richard Johansson & Luis Nieto Piña. 2015a. Combining Relational and Distributional Knowledge for Word
Sense Disambiguation. In Proceedings of the 20th Nordic Conference of Computational Linguistics, pages
69–78, Vilnius. Linköping University Electronic Press.
Richard Johansson & Luis Nieto Piña. 2015b. Embedding a Semantic Network in aWord Space. In Proceedings of
the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, pages 1428–1433, Denver.
Richard Johansson, Yvonne Adesam, Gerlof Bouma, & Karin Hedberg. 2016. AMulti-domain Corpus of Swedish
Word Sense Annotation. In Proceedings of the Language Resources and Evaluation Conference (LREC), pages
3019–3022, Portorož.
Richard Johansson. 2014. Automatic Expansion of the Swedish FrameNet Lexicon. Constructions and Frames,
6(1):92–113.
Jerker Järborg. 1999. Lexikon i konfrontation. Technical report, University of Gothenburg. Research Reports
from the Department of Swedish, Språkdata, GU-ISS-99-6.
Dimitrios Kokkinakis, Jerker Järborg, & Yvonne Cederholm. 2001. SENSEVAL-2: The Swedish Framework.
In Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation
Systems, pages 45–48, Toulouse.
Lennart Lönngren. 1989. Svenskt associationslexikon: Rapport från ett projekt inom datorstödd lexikografi. Tech-
nical report, Uppsala University. Svenskt associationslexikon: Rapport från ett projekt inom datorstödd lexiko-
grafi.
Martin Malmsten, Love Börjeson, & Chris Haffenden. 2020. Playing with words at the National Library of
Sweden – Making a Swedish BERT. arXiv preprint 2007.01658, https://arxiv.org/pdf/2007.01658.
pdf.
Luis Nieto Piña & Richard Johansson. 2016a. Benchmarking word sense disambiguation systems for Swedish. In
Swedish Language Technology Conference, Umeå.
Luis Nieto Piña & Richard Johansson. 2016b. Embedding Senses for Efficient Graph-based Word Sense Disam-
biguation. In Proceedings of the 2016 Workshop on Graph-based Methods for Natural Language Processing,
pages 2710–2715, San Diego.
Luis Nieto Piña & Richard Johansson. 2017. Training Word Sense Embeddings with Lexicon-based Regulariza-
tion. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1:
Long Papers), pages 284–294, Taipei.
66
Linguistics concepts as semantic frames
Per Klang Shafqat Mumtaz Virk
Department of Scandinavian Languages Språkbanken Text
Uppsala University University of Gothenburg
Uppsala, Sweden Gothenburg, Sweden
per.klang@nordiska.uu.se shafqat.virk@svenska.gu.se
Abstract
The topic of this paper is the representation of linguistic concepts as semantic frames. It presents
the general structure of a lexical resource for the linguistic domain, here referred to as LingFN.
This resource contains semantic frames for the linguistic terms and concepts found in traditional
grammars. In addition, the paper illustrates how LingFN can be used for the natural language
processing task of automatically extracting typological information from descriptive grammars
which is otherwise attained manually at a greater cost.
1 Introduction
A grammatical description is a form of document that describes various structural aspects of a natural
language. There are approximately 7 000 recorded natural languages (see ethnologue.com), and written
grammatical information is available for around 4 000 of these (see golottolog.org for details). At
present, there are ongoing endeavors to digitize this information so that modern computational techniques
can be exploited to analyze it and, ultimately, the languages themselves.
Digitization can be seen from two different prospects. The first regards the preparation of corpora
which involves scanning and OCRing of available printed resources on natural languages, and the sec-
ond covers structured digital representations of natural languages and linguistic concepts. This paper’s
concern is with the latter.
Over centuries of philosophical research, linguists have developed a variety of notions on how to dis-
cern and describe the phonological, morphological, and structural attributes of language. Indeed, there is
a reasonable amount of work on digital representation of lexicons of natural languages, and consequently
there exist many online dictionaries and related lexical resources (Warwick, 1988; Fellbaum, 1998). But,
to the best of our knowledge, the amount of work is limited, as regards the digital representation of
various grammatical aspects, and grammars of natural languages.
This paper is an attempt to fill the gap above by using frame-semantics – a theory of meaning in lan-
guage (Fillmore, 1976) – to develop special structures (semantic frames) for representing the concepts
of linguistics. Using a semiautomatic methodology, a set of semantic frames have been developed and
connected to each other via various types of relations resulting in a network of semantic frames called
LingFN, a framenet for the linguistics domain. LingFN is expected to be useful for various NLP tasks,
especially for automatic extraction of typological information from descriptive grammars which is other-
wise accomplished by labour intensive manual methods.
The paper begins with a brief presentation of frame semantics and its background (section 2), followed
by a description of LingFN and the idea of representing linguistic concepts as semantic frames (section 3).
Next is given a brief outline of some possible applications of LingFN (section 4). The paper is concluded
with a summary (section 5).
Per Klang and Shafqat Mumtaz Virk. 2022. Linguistics concepts as semantic frames. In
Volodina, Dannélls, Berdicevskis, Forsberg and Virk (editors), Live and Learn – Festschrift
in honor of Lars Borin, pages 67–71. Available under CC BY 4.0 67
2 Background
The general idea of frame semantics is that words are understood with respect to the situation that they
evoke in the mind of the speaker. The mapping between a word and a situation forms a conceptual
structure known as a semantic frame, which is a script-like description of a prototypical situation, an
event, or an object, along with its participants known as frame elements, or FEs for short (Ruppenhofer
et al., 2016). The ideas of frame semantics were first put to use in a lexico-semantic resource for English
called FrameNet, also known as Berkeley FrameNet (BFN), which contains a network of semantic frames
for general-language (Baker et al., 1998). This resource has successfully been used for automatic shallow
semantic parsing (Gildea & Jurafsky, 2002) which is employed in several natural language processing
tasks such as information extraction (Surdeanu et al., 2003), and question answering (Shen & Lapata,
2007), to name a few. The utility of FrameNet has also led to the development of framenets for a number
of other languages that build upon the BFN model, all of which have contributed to the understanding of
the semantic characteristics of each specific language.
Although general framenets have proven to be useful for many tasks, they have also been criticized for
their limited coverage. To cope with this problem, domain-specific framenets have been constructed as
complements to the corresponding general-language framenets in order to improve the performance of
NLP-tools for specific domains such as medicine (Borin et al., 2007), soccer (http://www.kicktionary.de/),
and soccer-related tourism (Torrent et al., 2014). The upcoming section presents the overall structure of
another domain specific framenet, namely the previously mentioned LingFN whose content consists of
semantic frames for linguistic terms and concepts that have been used in traditional linguistic grammars
(e.g. inflection, agreement, affixation, etc.). While a part of LingFN follows the general structure of event
frames in BFN, the other part is structured after an ontology of linguistic terms, known as GOLD.1
3 Linguistics concepts as semantic frames
LingFN is a framenet for the linguistics domain which has been created after the grammar (and language)
descriptions found in the 1.3 MW sub corpus of The Linguistic Survey of India (LSI) (Grierson, 1903
1927).2 Although the lexicographic work in LingFN assumes a slightly less complex structure than BFN,
it mostly draws upon the same model as described in the BFN manual (Ruppenhofer et al., 2016) where
words, or lexical units (LUs), with similar meaning are bundled up under the same frame. Like BFN,
it holds a network of semantic frames (Baker et al., 1998), but the major difference between the two
networks lies in that the former covers specific linguistic terms and concepts where the latter covers
more general concepts. Detailed descriptions of the development of LingFN are found elsewhere (Malm
et al., 2018; Virk et al., 2022).
The general design of LingFN is fairly simple, and it is illustrated in Figure 1 below. It has two frame
types: event frames which represent eventful types of scenes (or concepts), and filler frames, which
follow the general structure of linguistic terms in GOLD. It also contains two frame-to-frame links: an
inheritance link which forms a hierarchical IS-A relation between frames, and a used-by link which
connects the filler frames to any event frame in which they may appear.
Let us now consider an example of how semantic frames for the linguistics domain may be used to aid
in the identification of linguistic information from descriptive grammars. This is the case of the word
borrow which has a specific meaning in linguistics. Consider below the difference between the meaning
of borrow from the BFN BORROWING frame in (1a), as opposed to borrow in (1b) from LSI.
(1) a. Does my Mum borrow money off you? (BFN)
b. [. . .] the Musalmān dialect borrows freely from the Persian vocabulary. (LSI)
The BORROWING frame holds information about a situation involving a number of frame elements such
as a BORROWER that takes possession of a THEME belonging to a LENDER under the tacit agreement
1http://linguistics-ontology.org/
2LSI presents a comprehensive survey of the languages spoken in South Asia. It was conducted in the late nineteenth and
the early twentieth century by the British government, under the supervision of George A. Grierson. The survey resulted in a
detailed report comprising 19 volumes of around 9500 pages in total. The survey covered 723 linguistic varieties for which it
provides: a grammatical sketch, a core word list; and text specimens.
68
Figure 1: The basic structure of LingFN
that the THEME is to be returned after a DURATION of time. Example (1a) is clearly an instance of the
BORROWING frame, since it is conceptually necessary to imagine the return of the borrowed item given
a duration of time. This is not the case in (1b). A word that has been borrowed by a language cannot
be returned, since it was never really borrowed in the first place. The pseudo-character of the linguistic
borrowing frame, which we may refer to as PSEUDO-BORROWING, can be further illustrated by means
of reductio ad absurdum in the comparison of (2a–b) below.
(2) a. [. . . ] Paula had let her borrow the boat for a few hours[. . . ].  (BFN)* for a few hours 
b. This principle of formation is borrowed from Magahı̄ * until next spring . (Constructed)* temporarily 
The PSEUDO-BORROWING frame can thus be delimited from the BFN BORROW frame in that the latter
does not occur with an FE of DURATION. However, this observation is of limited use since it can only
be used for identifying the cases that are not PSEUDO-BORROWING. Still, as we shall see, this limitation
may be remedied by the filler frames and their used-by links to frames like that of PSEUDO-BORROWING.
The filler frames typically capture information about LUs that appear as frame elements in event
frames, such as the type, material, or color of the LU referent. Annotated examples from LSI are given
below marking the two filler frames GENETIC TAXON and LINGUISTIC DATA STRUCTURE.
(3) a. [. . .] [LANGUAGE_VARIETY the Musalmān] [LU dialect] borrows freely from the Persian
vocabulary. (GENETIC_TAXON)
b. [. . .] the Musalmān dialect borrows freely from the [LANGUAGE_VARIETY Persian] [LU
vocabulary]. (LINGUISTIC_DATA_STRUCTURE)
The information in the filler frames can aid in the disambiguation of polysemous verbs like borrow. The
linking of filler frames to event frames with used-by links allows for the FEs of the event frame to be
checked against the LUs listed in the LingFN filler frames. On the one hand, if the content of an FE of
some event frame is not listed in the set of filler frames, the event frame probably belongs to the general
domain. On the other hand, if the content of the FE is listed in the set of filler frames, as illustrated by
the sentence in the table below, it would suggest that the event frame is specific to the linguistic domain.
Sentence: The Musalman dialect borrows from the Persian vocabulary
Event FEs: Borrower LU Lender
Filler FEs: Language variety LU
Filler FEs: Language variety LU
In its current state, LingFN houses nearly 100 frames with 325 used-by links, about 360 lexical units,
and more than 2 800 annotated sentences from LSI; see statistics below.
69
Frame Type Frames Used-by links Lexical Units Annotated examples
Event frames 5 171 25 1 858
Filler frames 94 154 335 948
Total 99 325 360 2 806
The dual structure of the event and filler frames, coupled with the inheritance and used-by links, pro-
vides a simple architecture that may be exploited for applications aimed at automatic extraction of infor-
mation from grammatical descriptions. These applications form the subject of the next section.
4 Applications of linguistic domain semantic frames
LingFN was developed particularly for the extraction of typological linguistic information from descrip-
tive grammars.3 Traditionally, the extraction of typological information from descriptive grammars is
done manually as a part of the development of typological databases (e.g. https://wals.info/). While the
manual curation of such databases takes a lot of time and effort, the usefulness of the end result is often
cited to justify the development cost. A reasonable alternative, which has not been subject to extensive
consideration in the literature, is to automatically extract the typological information. LingFN has proven
to be helpful in this respect, and the general process to meet this end is described next.
In the first stage, the linguistic concepts were gathered from the grammatical descriptions, and anno-
tated as semantic frames. These annotations were used to train a parser, which subsequently was used to
annotate additional grammatical descriptions. The parser annotations were then converted to typological
features values to be used for the purpose of linguistic analysis. For more details on the annotation task,
the parser development, and the conversion from semantic parses to typological features, see Virk et al.,
(2017), and Virk et al., (2019).
5 Summary
This paper has proposed to use frame semantics for structured digital representations of traditional lin-
guistic concepts. The concepts have been gathered from LSI and the GOLD ontology of linguistic terms
and rendered as semantic frames in LingFN, a framenet for the domain of linguistics. It has further been
shown how this resource can be used for automatic extraction of information from descriptive grammars.
6 Acknowledgements
The work presented here has been financially supported by the Swedish Research Council through its
funding of the projects South Asia as a linguistic area? Exploring big-data methods in areal and genetic
linguistics (20152019; contract 421-2014-969) and Swe-Clarin (20142018; contract 821-2013-2003), the
Swedish Foundation for International Cooperation in Research and Higher Education (STINT) through
its Swedish-Brazilian research collaboration program (20142019; contract BR2014-5860), and the Uni-
versity of Gothenburg, its Faculty of Arts and its Department of Swedish, through their truly long-term
support of the Sprakbanken research infrastructure.
References
Collin F. Baker, Charles J. Fillmore, & John B. Lowe. 1998. The Berkeley FrameNet project. In Proceedings of
ACL/COLING 1998, pages 86–90, Montreal. ACL.
Lars Borin, Maria Toporowska Gronostaj, & Dimitrios Kokkinakis. 2007. Medical frames as target and tool. In
FRAME 2007: Building Frame Semantics resources for Scandinavian and Baltic languages. (Nodalida 2007
workshop proceedings), pages 11–18, Tartu. NEALT.
3There is, however, nothing that prevents it from being used for other purposes.
70
Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. Bradford Books.
Charles J. Fillmore. 1976. Frame semantics and the nature of language. Annals of the New York Academy of
Sciences, 280(1):20–32.
Daniel Gildea & Daniel Jurafsky. 2002. Automatic Labeling of Semantic Roles. Computational Linguistics,
28(3):245–288.
George A. Grierson. 1903–1927. A Linguistic Survey of India, volume I–XI. Government of India, Central
Publication Branch, Calcutta.
Per Malm, Shafqat Mumtaz Virk, Lars Borin, & Anju Saxena. 2018. LingFN: Towards a Framenet for the Lin-
guistics Domain. In Proceedings of the IFNW 2018 Workshop on Multilingual FrameNets and Constructicons
at LREC 2018, Miyazaki. ELRA.
Josef Ruppenhofer, Michael Ellsworth, Miriam RL Petruck, Christopher R Johnson, Collin F Baker, & Jan Schef-
fczyk. 2016. FrameNet II: Extended Theory and Practice.
Dan Shen & Mirella Lapata. 2007. Using Semantic Roles to Improve Question Answering. In Proceedings of
EMNLP-CoNLL 2007, pages 12–21, Prague. ACL.
Mihai Surdeanu, Sanda Harabagiu, John Williams, & Paul Aarseth. 2003. Using Predicate-Argument Structures
for Information Extraction. In Proceedings of ACL 2003, pages 8–15, Sapporo. ACL.
Tiago Timponi Torrent, Maria Margarida Martins Salomão, Ely Edison da Silva Matos, Maucha Andrade Gamonal,
Júlia Gonçalves, Bruno Pereira de Souza, Daniela Simões Gomes, & Simone Rodrigues Peron-Corrêa. 2014.
Multilingual lexicographic annotation for domain-specific electronic dictionaries: The Copa 2014 FrameNet
Brasil project. Constructions and Frames, 6(1):73–91.
Shafqat Virk, Lars Borin, Anju Saxena, & Harald Hammarström. 2017. Automatic extraction of typological
linguistic features from descriptive grammars. In Proceedings of TSD 2017. Springer.
Shafqat Mumtaz Virk, Azam Sheikh Muhammad, Lars Borin, Muhammad Irfan Aslam, Saania Iqbal, & Nazia
Khurram. 2019. Exploiting Frame-Semantics and Frame-Semantic Parsing for Automatic Extraction of Ty-
pological Information from Descriptive Grammars of Natural Languages. In Proceedings of the International
Conference on Recent Advances in Natural Language Processing (RANLP), pages 1247–1256, Varna. INCOMA
Ltd.
Shafqat Virk, Per Malm, Lars Borin, & Anju Saxena. 2022. LingFN: A FrameNet for the Linguistics Domain. In
Proceedings of CICLing 2019.
Susan Warwick. 1988. Automated lexical resources in europe: A survey. Lexicographica (1988), 4(1988):93–129.
71

Deep learning models as theories of linguistic knowledge
Shalom Lappin
Centre for Linguistic Theory and Studies in Probability
Department of Philosophy, Linguistics, and Theory of Science
University of Gothenburg
shalom.lappin@gu.se
Abstract
Recent work in deep learning had made remarkable progress across a wide range of complex
tasks in artificial intelligence in general, and in natural language processing in particular. It has
yielded solutions to problems that were recalcitrant to traditional rule-based methods over many
decades. It is worth considering the possibility that deep learning systems are not simply en-
gineering procedures, but viable models of linguistic representation and language acquisition.
When exploring this option it is important to keep in mind the serious limitations of these sys-
tems. In this paper I briefly explore these questions, and I offer some tentative conclusions.
1 Introduction
Over the past 70 years most linguists have used formal grammars and model theories to encode linguistic
knowledge. Similarly, until 2000 rule systems and knowledge representation logics dominated many ar-
eas of artificial intelligence. The rise of deep learning in AI in general, and in natural language processing
(NLP) in particular, has largely displaced symbolic methods in computational linguistics.
Baroni (2021) observes that most theoretical linguists have taken little, if any, notice of the role of
deep neural networks (DNNs) in NLP. He argues that DNNs should be considered as possible alternative
theories of linguistic representation. In Lappin (2021) I suggest a similar view. In this paper I will briefly
address some of the arguments for this approach, and several of its implications.
2 Alternative approaches to representation and learning
A wealth of important work on individual languages, and on cross linguistic patterns, has been done
within the formal grammar framework. Many (most?) advocates of these formalisms have assumed that
they correspond to the way in which humans encode knowledge of their language, at some level of
representation. Some theorists have claimed that the class of formal grammars that generate the set of
natural languages cannot be learned on the basis of the primary linguistic data available to children,
through domain general learning procedures. They have concluded that humans bring strong domain
specific learning biases to bear on the language acquisition task. They have identified these biases with a
Universal Grammar (UG), construed as a schematic cognitive structure that restricts the set of hypotheses
available to the language learner, concerning the design of a possible grammar.
DNNs are not rule-based algebraic systems. They do not, in general, apply grammatical rules or con-
straints as part of their training regimen, nor do they extract symbolic representations from data. The
success of deep learning in NLP raises the question of whether a non-symbolic, domain general inductive
system may offer a viable model of the way in which humans acquire and represent linguistic knowledge.
It is possible to enrich a neural network with syntactic or semantic learning biases in a variety of ways. It
is important to consider such systems to see whether these enrichments improve performance on a given
set of tasks.
If a DNN learns to solve a task that requires interesting types of linguistic information, at a level
approaching human performance, within tractable amounts of time and data, then it provides a demon-
Shalom Lappin. 2022. Deep learning models as theories of linguistic knowledge. In Volodina,
Dannélls, Berdicevskis, Forsberg and Virk (editors), Live and Learn – Festschrift in honor of
Lars Borin, pages 73–78. Available under CC BY 4.0 73
stration of how humans could, in principle, acquire this knowledge. To the extent that this learning uses
domain general inductive mechanisms, it shows that strong linguistic biases may not be required for this
element of language acquisition. Success in deep learning also indicates that humans could encode the
knowledge required to perform a given task through distributed, non-symbolic representations.
While formal grammars offer elegant formalisms for encoding syntactic information (as do model the-
ories for semantic content), they have generally not provided robust, wide coverage systems for handling
linguistically interesting NLP applications. By contrast, deep learning has achieved remarkable success
over a wide variety of tasks. They include, among others, identifying subject-verb agreement (Linzen et
al., 2016; Bernardy & Lappin, 2017; Gulordava et al., 2018), machine translation (Bahdanau et al., 2015),
image description (He et al., 2020), and prediction of sentence acceptability (Lau et al., 2020).
Formal grammars and model theories pose serious problems of learnability.1 By contrast, recent work
in deep learning has shown that relatively domain general inductive learning devices can learn to solve
cognitively interesting NLP tasks within tractable limits of time and data.
3 Hybrid models: How much do linguistic theories add to NLP
It is natural to assume that integrating linguistic theories into deep learning models will improve the
performance of DNNs, and add robust wide coverage to formal grammars. While this view is intuitively
appealing, its correctness is far from obvious.
There are (at least) two ways in which hybrid systems of this kind can be implemented. First, it is
possible to train a DNN to identify tree structures in data through design and training (Tai et al., 2015;
Socher et al., 2011; Bowman et al., 2016; Yogatama et al., 2017; Choi et al., 2018; Maillard et al., 2019;
Kuncoro et al., 2018; Kuncoro et al., 2019; Kuncoro et al., 2020). Second, we can enrich the training
data with syntactic and semantic feature markers, rendering these markers part of the information that
the DNN learns to generalise over (Ek et al., 2019).
Both sorts of model have been tested on a variety of NLP tasks, including sentiment analysis, natural
language inference (NLI), and sentence acceptability prediction. In general, they have yielded only very
small improvements, if any, over the performance of their non-enriched counterparts. Ek et al. (2019)
report that the addition of syntactic and semantic tags to the training data actually degraded the level of
accuracy of the LSTM tested on this task.2
The fact that hybrid DNNs incorporating symbolic linguistic information have not yielded particularly
interesting results to date does not show that the approach is misguided. More successful versions of
such models may yet emerge from future work. However, these results do indicate that DNNs learn in
ways that do not straightforwardly accommodate symbolic representations and rule systems. It is worth
considering the possibility that the way in which humans acquire natural languages, and other forms of
knowledge, may be closer to deep learning than to classical paradigms of grammar induction.
4 The limitations of deep learning
The remarkable success of deep learning in NLP has tended to obscure the serious limitations of DNNs.
There are at least three of these worth noting here. First, DNNs require very large quantities of training
data in order to achieve reasonable performance, on most complex NLP tasks. This renders a comparison
with human learning moot, given that children do not require this amount of data to achieve linguistic
knowledge. Use of reinforcement learning (François-Lavet et al., 2018), and multimodal encode-decoder
models (Hill et al., 2020a) may go some way to alleviating the need for large training data sets. Much
work remains to be done on this problem.
The second difficulty is that DNNs generally lack transparency concerning their mode of operations,
and the generalisations that they extract from data. It is often unclear how they learn the generalisations
that they achieve. This undermines their usefulness as explanatory models of learning. I will take this
issue up further in Section 5.
1See Clark & Lappin (2011) for discussion of complexity and learnability in grammar induction.
2See Lappin (2021) for detailed discussion of these hybrid models, and additional references.
74
Finally, deep learning systems frequently encounter problems in generalising to domains beyond those
on which they are trained. The error rate of a DNN generally increases in proportion to the distance
between the examples of its test set and those of its training data (Lake & Baroni, 2018). Also adversarial
testing has indicated that small permutations in the test set can produce dramatic changes in accuracy,
particularly in cognitively complex tasks like NLI (Talman & Chatzikyriakidis, 2019; Talman et al.,
2021). I will briefly suggest a possible way of dealing with this limitation in Section 6.
5 The opacity problem
The absence of transparency in deep learning systems has become substantial with the move to large
multi-headed transformers, many of which are non, or bidirectional, like BERT (Devlin et al., 2019). The
primary source of opacity in DNNs is their use of non-linear functions, such as sigmoid and hyperbolic
tangent, to compute the output states of units from their inputs. These functions cause the vectors that
each layer of a DNN produces to be, in the general case, non-compositional. This is due to the fact that
the representations of the input and the output vectors cannot be represented by a homomorphic mapping
operation.3
Bernardy & Lappin (2022) and Bernardy & Lappin (2023) suggest a solution to the opacity problem.
They propose Unitary Evolutionary Recurrent Neural Networks (URNs) for NLP. An URN uses unitary
matrix word embeddings and simple linear operations on them to process linguistic input.4 They do
not contain non-linear functions, and so they are strictly compositional in the operations through which
they combine word embeddings to obtain output matrices. URNs correspond to quantum circuits. No
information is lost in the course of processing input. Earlier states of the network are fully recoverable.
Although URNs achieve promising results in recognising long distance agreement patterns in artificial
languages, they have some way to go before attaining the scale, level of performance, and coverage of
current state of the art non-transparent DNNs.
6 Extending the scope of generalisation
As we observed in Section 4, deep learning continues to contend with difficulties in generalisation and
overfitting in many areas of NLP. Hill et al. (2020a) and Hill et al. (2020b)’s work on multimodal seman-
tic learning points to an encouraging direction for solving this problem. They train agents in simulated
visual environments to recognise new natural language commands, and to respond to them appropriately
in these environments. They use a multimodal encoder-decoder architecture for this task, which extends
the models applied to machine translation, to achieve dynamic visual grounding of language.
The systems that they describe perform more substantial learning and generalisation beyond the train-
ing data than previous models that are limited to linguistic input. These systems come closer to approxi-
mating human learning, which is grounded in a multimodal environment, than DNNs trained exclusively
on linguistic data. When input from visual and other modalities is coordinated with linguistic training
data, it may be possible to reduce the size of the linguistic training set. It is necessary to recognise that in-
formation from these modalities is indispensable to human language acquisition, and it should be counted
as part of the data that supports the acquisition process.
7 Conclusions and future work
The robustness and wide coverage that deep learning systems are demonstrating across a wide range of
cognitively interesting NLP tasks suggests that they are more than engineering devices for producing
effective language technology. It is worth taking them seriously as possible models of linguistic represen-
tation and language acquisition. Attempts to integrate symbolic elements of linguistic theory into these
systems have not yielded dramatic improvements in their performance to date. This situation may change
3A mapping f : A → B from group A to group B is a homomorphism iff for every vi,v j ∈ A, and the group operation ·,
f (A ·B) = f (A) · f (B), where f (A ·B) ∈ A, and f (A) · f (B) ∈ B.
4A complex square matrix U is unitary if its composite transpose U∗ is identical to its inverse U−1. The word embeddings
of URNs contain only real numbers, and so they are orthogonal matrices.
75
in the future. At this point, these results suggest that linguistic theories may not be particularly useful for
NLP tasks, as DNNs process information in a way that does not easily accommodate symbolic encoding.
DNNs exhibit serious limitations in large data requirements, opacity, and range of generalisation. Work
on reinforcement learning in NLP, on transparent models, and on multimodal encoder-decoder architec-
tures offers promising approaches to solving these problems. Considerably more research is needed on
how to deal with noise in linguistic input without destabilisation. This is particularly important for tasks
involving NLI, dialogue management, and text interpretation.
Closer cooperation among computational linguists, cognitive psychologists, and neuroscientists is
needed in order to assess the similarities and the differences between the ways in which DNNs and
humans process linguistic information. Deep learning systems can offer indirect insights into learning
and representation by demonstrating what sort of knowledge can be acquired by such a system on the ba-
sis of certain kinds of training, with learning biases of a general, or a domain specific type, within given
limits of time and data. However, to determine to what extent, if any, these devices actually correspond
to human learning and representation it is necessary to study the latter in comparison with the perfor-
mance of DNNs. Only comparative research of this kind can illuminate whether deep learning provides
plausible models of human linguistic knowledge.
Acknowledgements
It is a pleasure and an honour to contribute a paper to Lars Borin’s festschrift. Lars has made major
contributions to linguistics and NLP in Sweden. The legacy of his work has had a lasting impact in these
areas, and it will continue to do so in the future. He is a good friend and a fine colleague. I look forward
to many more years of fruitful scientific exchange and friendship with him.
My research reported in this paper was supported by grant 2014-39 from the Swedish Research Coun-
cil, which funds the Centre for Linguistic Theory and Studies in Probability (CLASP) in the Department
of Philosophy, Linguistics, and Theory of Science at the University of Gothenburg.
References
Dzmitry Bahdanau, Kyunghyun Cho, & Yoshua Bengio. 2015. Neural machine translation by jointly learning to
align and translate. ArXiv, abs/1409.0473.
Marco Baroni. 2021. On the proper role of linguistically-oriented deep net analysis in linguistic theorizing. ArXiv,
2106.08694, (to appear in (Lappin & Bernardy, in press)).
Jean-Philippe Bernardy & Shalom Lappin. 2017. Using deep neural networks to learn syntactic agreement. Lin-
guistic Issues In Language Technology, 15:1–15.
Jean-Philippe Bernardy & Shalom Lappin. 2022. Assessing the unitary rnn as an end-to-end compositional model
of syntax. In End-to-End Compositional Models of Vector-Based Semantics 2022, Electronic Proceedings in
Theoretical Computer Science 366.4, pages 9–22.
Jean-Philippe Bernardy & Shalom Lappin. in press. Unitary recurrent networks: Algebraic and linear structures
for syntax. In Shalom Lappin & Jean-Philippe Bernardy, editors, Algebraic Structures in Natural Language.
CRC Press, Taylor & Francis, Boca Raton, London, New York.
Samuel R. Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D. Manning, & Christopher
Potts. 2016. A fast unified model for parsing and sentence understanding. In Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1466–1477, Berlin.
Association for Computational Linguistics.
Jihun Choi, Kang Min Yoo, & Sang goo Lee. 2018. Learning to compose task-specific tree structures. In AAAI
Conference on Artificial Intelligence.
Alexander Clark & Shalom Lappin. 2011. Linguistic Nativism and the Poverty of the Stimulus. Wiley-Blackwell,
Malden, MA and Oxford.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, & Kristina Toutanova. 2019. BERT: Pre-training of deep bidirec-
tional transformers for language understanding. In Proceedings of the 2019 Conference of the North American
76
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and
Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Adam Ek, Jean-Philippe Bernardy, & Shalom Lappin. 2019. Language modeling with syntactic and semantic
representation for sentence acceptability predictions. In Proceedings of the 22nd Nordic Conference on Compu-
tational Linguistics, pages 76–85, Turku.
Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, & Joelle Pineau. 2018. An introduc-
tion to deep reinforcement learning. Foundations and Trends in Machine Learning, 11(3-4):219–354.
Kristina Gulordava, Piotr Bojanowski, Edouard Grave, Tal Linzen, & Marco Baroni. 2018. Colorless green
recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers),
pages 1195–1205, New Orleans, Louisiana. Association for Computational Linguistics.
Sen He, Wentong Liao, Hamed R. Tavakoli, Michael Yang, Bodo Rosenhahn, & Nicolas Pugeault. 2020. Image
captioning through image transformer. ArXiv, pages 1–17.
Felix Hill, Andrew K. Lampinen, Rosalia Schneider, Stephen Clark, Matthew Botvinick, James L. McClelland,
& Adam Santoro. 2020a. drivers of systematicity and generalization in a situated agent. In 8th International
Conference on Learning Representations, ICLR 2020, Addis Ababa.
Felix Hill, Olivier Tieleman, Tamara von Glehn, Nathaniel Wong, Hamza Merzic, & Stephen Clark. 2020b.
Grounded language learning fast and slow. ArXiv.
Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yogatama, Stephen Clark, & Phil Blunsom. 2018. LSTMs can
learn syntax-sensitive dependencies well, but modeling structure makes them better. In Proceedings of the 56th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1426–1436,
Melbourne. Association for Computational Linguistics.
Adhiguna Kuncoro, Chris Dyer, Laura Rimell, Stephen Clark, & Phil Blunsom. 2019. Scalable syntax-aware
language models using knowledge distillation. In Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics, pages 3472–3484, Florence. Association for Computational Linguistics.
Adhiguna Kuncoro, Lingpeng Kong, Daniel Fried, Dani Yogatama, Laura Rimell, Chris Dyer, & Phil Blunsom.
2020. Syntactic structure distillation pretraining for bidirectional encoders. ArXiv.
Brenden Lake & Marco Baroni. 2018. Generalization without systematicity: On the compositional skills of
sequence-to-sequence recurrent networks. In Jennifer Dy & Andreas Krause, editors, Proceedings of Machine
Learning Research, volume 80, pages 2873–2882, Stockholm. PMLR.
Shalom Lappin & Jean-Philippe Bernardy, editors. in press. Algebraic Structures in Natural Language. CRC
Press, Taylor & Francis, Boca Raton, London, New York.
Shalom Lappin. 2021. Deep Learning and Linguistic Representation. CRC Press, Taylor & Francis, Boca Raton,
London, New York.
Jey Han Lau, Carlos Armendariz, Shalom Lappin, Matthew Purver, & Chang Shu. 2020. How furiously can col-
orless green ideas sleep? sentence acceptability in context. Transactions of the Association for Computational
Linguistics, 8:296–310.
Tal Linzen, Emmanuel Dupoux, & Yoav Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive
dependencies. Transactions of the Association for Computational Linguistics, 4:521–535.
Jean Maillard, Stephen Clark, & Dani Yogatama. 2019. Jointly learning sentence embeddings and syntax with
unsupervised tree-lstms. Natural Language Engineering, 25(4):433–449.
Richard Socher, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng, & Christopher D. Manning. 2011. Semi-
supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the 2011 Conference
on Empirical Methods in Natural Language Processing, pages 151–161, Edinburgh. Association for Computa-
tional Linguistics.
Kai Sheng Tai, Richard Socher, & Christopher D. Manning. 2015. Improved semantic representations from tree-
structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for
Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume
1: Long Papers), pages 1556–1566, Beijing. Association for Computational Linguistics.
77
Aarne Talman & Stergios Chatzikyriakidis. 2019. Testing the generalization power of neural network models
across NLI benchmarks. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting
Neural Networks for NLP, pages 85–94, Florence. Association for Computational Linguistics.
Aarne Talman, Marianna Apidianaki, Stergios Chatzikyriakidis, & Jörg Tiedemann. 2021. NLI data sanity check:
Assessing the effect of data corruption on model performance. In Proceedings of the 23rd Nordic Conference on
Computational Linguistics (NoDaLiDa), pages 276–287, Reykjavik (Online). Linköping University Electronic
Press.
Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, &Wang Ling. 2017. Learning to compose words
into sentences with reinforcement learning. In 5th International Conference on Learning Representations, ICLR
2017, Toulon, Conference Track Proceedings.
78
The case for Språkbanken Dialog
Staffan Larsson Christine Howes Eleni Gregoromichelaki
Department of Philosophy, Linguistics and Theory of Science
University of Gothenburg, Sweden
staffan.larsson@ling.gu.se
christine.howes@gu.se
eleni.gregoromichelaki@gu.se
Abstract
We argue that the National Language Bank of Sweden should be extended with an additional
infrastructure supporting research on linguistic interaction. Our main argument is that dialogue
is not (just) text or speech, and consequently, that studying dialogue requires dialogue-specific
infrastructure.
1 Introduction
One of Lars Borin’s main achievements is, of course, his very successful development and management
of Språkbanken, now a part of the National Language Bank of Sweden under the name Språkbanken Text.
Following his lead and inspiring pioneering example, we would like to take this opportunity to argue the
case for a Swedish Dialogue Bank - Språkbanken Dialog.
2 Språkbanken
Quoting from the homepage of the National Language Bank of Sweden, “The purpose of The National
Language Bank of Sweden is to develop a national e-infrastructure supporting research in language tech-
nology, linguistics and other fields of study where research is conducted based on language data.” Cur-
rently, the National Language Bank of Sweden has three parts – Språkbanken Text, Språkbanken Tal and
Språkbanken Sam. These all provide important infrastructures for researchers in various fields. Språk-
banken Text contains text corpora that can be searched for occurrences of words and phrases, including
longitudinal data. Språkbanken Tal contains (or will contain when it is launched in 2023) recorded speech
aligned with text for use in research and development of speech technologies. Språkbanken Sam contains
text and some speech recordings focusing on (1) official multilingual texts and terminology for research
in official communication and social conditions, and (2) folk narratives, as well as other text and speech
material from the dialect and folklore archives.
The National Language Bank of Sweden is a significant achievement and a valuable resource for lan-
guage technology purposes. However, a considerable lacuna, in our view, remains: these resources do
not provide a comprehensive collection of spoken, written, and/or multimodal interactions in Swedish
(and/or minority languages) that are available and searchable in the way that is needed to explore the
interactive aspects of language use and structure. This is what we argue is still needed.
3 Dialogue
It is now widely accepted that human conversation does not consist of a sequence of sentences simply
placed one after the other. There are specific phenomena that only become visible at the level of dialogical
interaction, for example, so-called “grounding processes” (Clark, 1996), turn-taking (Sacks et al., 1974),
repair (Schegloff et al., 1977), and multimodal input and output (Bavelas & Gerwing, 2007), which are
Staffan Larsson, Christine Howes and Eleni Gregoromichelaki. 2022. The case for Språk-
banken Dialog. In Volodina, Dannélls, Berdicevskis, Forsberg and Virk (editors), Live and
Learn – Festschrift in honor of Lars Borin, pages 79–82. Available under CC BY 4.0 79
the features of dialogue that make it so much easier to process and engage in than monologue. On the
other hand, phenomena which have been considered sentence-internal and requiring specialised syntac-
tic/semanticmechanisms can be seen under a new, more illuminating, light when considered in the context
of conversation. For example, phenomena like anaphora, ellipsis, syntactic/semantic dependencies, and
speech act recognition/production can extend across turns and participants (see (3), below). In fact, it can
be shown that such puzzling phenomena rely more crucially on interactive mechanisms for their resolu-
tion than individual processing capacities, a case of ‘computational offloading’ to the social environment
(Gregoromichelaki, 2017). Across linguistics, psychology, philosophy, and cognitive science, it is now
recognised that the primary ecological niche of language use is face-to-face interaction. Therefore, it
has now become common to talk about the human ‘interaction engine’ (Levinson, 2020) to refer to the
evolutionarily and culturally shaped linguistic skills and social capacities that are involved in language
processing and general action coordination. Formal grammars, computational implementations, and lin-
guistic/psycholinguistic theories now attempt to model formally and test experimentally these interactive
processes to explain human linguistic cognition and behaviour (Ginzburg, 2012; Gregoromichelaki et al.,
2020; Healey et al., 2018; Cooper, 2022).
In the field of language technology and AI, it is also becoming a familiar theme to address human
interaction and conversation as the source of invaluable data. Many current architectures take advantage
of training data from dialogue and multimodal corpora, whether annotated or not, and there is a recogni-
tion in recent work that large-scale language models – even those which make use of visual data – lack
sufficient training data of conversational strategies such as repair (Lemon, 2022). Additionally, models
increasingly seek to leverage interactive processes with human-in-the-loop teaching and supervision as
a means of extending the capabilities of Large Language Models and artificial agents like social robots
developing their trustworthiness, reliability, and alignment with human values.
As an illustration, let us look at an example of a dialogue with the sort of annotations we envision for
Språkbanken Dialog:
(1) STANLEY: Louis, I[ref :STANLEY] just didn’t[NPI−licensor] think
[[assertion; change of turn: split utterance]]
LOUIS: you[ref :STANLEY]’d ever[NPI] hear from me[ref :LOUIS]?
[[continuation & clarification & confirmation request & quotation]]
[BBC Transcripts, Dancing to the Edge, Episode 5, example from: Gregoromichelaki (2017)]
Here the annotation needs to indicate the dialogue-act multifunctionality of subsentential turns. We also
need to have information about the dependency between the Negative Polarity Item (NPI) ever and its
licensor n’t that occur in different turns by different speakers even though no single surface string can
be syntactically reconstructed. In confirmation of this, it needs to be indicated in the annotation how
the incremental change of speaker within a quotative clause reporting the first speaker’s mental state
(‘Stanleyspeaker did not think ∥ that Stanleyaddressee will hear from Louisspeaker’) results in incremental
switches in the interpretation of indexicals. This evidence of dependencies crossing turns and speakers
render untenable any simple analysis of the shared string as a joined surface syntactic form with respect
to the semantics:
(2) #Louis I just didn’t think you’d ever hear from me.
In addition, it is demonstrated that grammatical analyses need to incorporate semantic and, crucially,
pragmatic factors, e.g., turn-taking in dialogue, in order to provide a coherent and unified analysis of
syntactic/semantic phenomena. Moreover, understanding both human psychological processes and the
functioning of end-to-end models and AI architectures with respect to linguistic behaviour requires be-
coming aware and modelling such interactions of what have been standardly taken as separate modules
of linguistic/non-linguistic knowledge in standard monological accounts.
80
4 Språkbanken Dialog
With this in mind, let us try to explain in more detail why Språkbanken Dialog is needed, and how we
envision it.
Språkbanken Dialog is (would be) a large collection of linguistic interactions, including video record-
ings of face-to-face interactions, audio recordings of spoken interactions, transcribed interactions (aligned
with the source video or sound recordings), and written interactions taken e.g. from social media and chat
applications. It is possible to view, annotate and analyse individual interactions across multiple turns –
something not currently offered by any Språkbanken resources. It is also possible to relate individual in-
teractions to each other, e.g. temporally, spatially, or with respect to the speakers involved (while keeping
to GDPR restrictions).
What about overlap with existing Språkbanken resources? It is true that other Språkbanken resources
already contain linguistic interactions. In fact, as far as possible, such material should also be included in
Språkbanken Dialog. However, none of the existing resources offer the possibility of adequately explor-
ing the interactive aspects of these dialogues. In Språkbanken Text, interactions are treated as any other
text, and it is not possible to see full interactions across several turns, nor to annotate or analyse them.
The argument for Språkbanken Dialog rests on the fact that linguistic interaction is not reducible to, or
analysable in terms of, individual words or phrases.
So maybe Språkbanken Dialog could just be a different interface to existing Språkbanken resources?
Such a thing would certainly be useful, but there are also reasons to include additional resources not cov-
ered by other Språkbanken infrastructure. Currently, linguistic interactions are collected by researchers
and students working on dialogue in the course of their research activities. This data can be in the form
of text, audio, video, or some combination thereof. Currently, a lot of these resources never become
available to other researchers. We believe that Språkbanken Dialog could offer infrastructure that would
enable and encourage low-effort sharing, annotation and analysis of dialogue data (including multimodal
data), thus boosting research on linguistic interaction in Swedish and other languages.
5 Future work
We leave for future work to fund, organise and implement Språkbanken Dialog. In this, we hope to follow
Lars Borin’s inspiring example.
Acknowledgements
We acknowledge support from the Swedish Research Council VR project 2014-39 for the establishment
of the Centre for Linguistic Theory and Studies in Probability (CLASP) at the University of Gothenburg.
References
Janet B. Bavelas & Jennifer Gerwing. 2007. Conversational hand gestures and facial displays in face-to-face
dialogue. Frontiers of social psychology: Social communication, pages 283–307.
Herbert H. Clark. 1996. Using Language. Cambridge University Press, Cambridge.
Robin Cooper. 2022. From perception to communication: An analysis of meaning and action using a theory of
types with records (TTR). Oxford University Press, Oxford. to appear.
Jonathan Ginzburg. 2012. The Interactive Stance: Meaning for Conversation. Oxford University Press, Oxford.
Eleni Gregoromichelaki, Gregory J. Mills, Christine Howes, Arash Eshghi, Stergios Chatzikyriakidis, Matthew
Purver, Ruth Kempson, Ronnie Cann, & Patrick G. T. Healey. 2020. Completability vs (in)completeness. Acta
Linguistica Hafniensia.
Eleni Gregoromichelaki. 2017. Quotation in Dialogue. In The Semantics and Pragmatics of Quotation, pages
195–255. Springer.
Patrick G. T. Healey, Gregory J. Mills, Arash Eshghi, & Christine Howes. 2018. Running Repairs: Co-ordinating
Meaning in Dialogue. Topics in Cognitive Science, 10(2):367–388.
81
Oliver Lemon. 2022. Conversational grounding in emergent communication–data and divergence. In Emergent
Communication Workshop at ICLR 2022.
Stephen C Levinson. 2020. On the human “interaction engine”. In N. J. Enfield & S.C. Levinson, editors, Roots
of human sociality: Culture, cognition and interaction, pages 39–69. Routledge.
Harvey Sacks, E.A. Schegloff, & Gail Jefferson. 1974. A simplest systematics for the organization of turn-taking
for conversation. Language, pages 696–735.
E.A. Schegloff, G. Jefferson, & H. Sacks. 1977. The preference for self-correction in the organization of repair in
conversation. Language, 53(2):361–382.
82
Ordbildning på icke-verbal grund
Om integrering av ikoner och ljudeffekter i grammatiken
Benjamin Lyngfelt Joel Olofsson
University of Gothenburg University West
Sweden Trollhättan, Sweden
benjamin.lyngfelt@svenska.gu.se joel.olofsson@hv.se
Abstract
In this paper we discuss a couple of examples of morphosyntactic integration of non-verbal mate-
rial in written Swedish, specifically icons and music beats. It is argued that classification in terms
of parts-of-speech is not a particularly fruitful approach in these cases. Instead, we suggest a con-
structional perspective and briefly illustrate how these phenomena may be analyzed in terms of
constructions.
1 Inledning
Kommunikation kan vara multimodal på flera olika vis (se t.ex. Björkvall, 2009). I den här texten inriktar
vi oss på hur ljud och bild kan integreras inte bara i texten utan även morfosyntaktiskt. Verbalisering
av icke-verbala ljud är förvisso ingenting nytt; det är själva grunden för onomatopoetiska uttryck. Desto
nyare är de möjligheter att integrera bilder i alfabetisk skrift som har utvecklats i s.k. e-kommunikation.
De senaste årtiondena har först emojier och sedan andra ikoner tagit plats i skriftspråksrepertoaren, först
som självständiga indikationer om meddelandets funktion ( , )1 eller skribentens sinnesstämning
( , , ) och senare även mer integrerat med den verbalt uttryckta skriften (se McCulloch, 2019).
Detta illustreras i (1), där ikonen utgör förled i en sammansättning; dessutom innehåller exemplet det
ljudhärmande verbuttrycket oontz, oontz, oontzade:
(1) ”Bang local MILFS” stod det på -bilen som oontz, oontz, oontzade förbi när jag väntade utanför
sextonåringens skola. Gulligt ändå. (Exemplet hämtat från Twitter; 26 september 2022.)
I sammansättningen -bilen fungerar som förled till efterledet bilen och anger vilken sorts bil det
handlar om. I detta fall handlar det om en s.k. A-traktor (även kallad EPA-traktor och LGF-fordon), och
avser här den varningsskylt som sitter baktill på sådana långsamtgående fordon. I sig är en generell
varningssymbol; det är först när den placeras på en bil som den betyder ‘A-traktor’. Uttrycket -bilen
påminner om nominala sammansättningar av typen [N-N] (jfr brandbil, elbil, glassbil), vilket skulle kun-
na tyda på att ikonen tolkas nominalt i sammanhanget. Innebär det i så fall att den är att betrakta som
ett substantiv?
Oontz, oontz, oontza är ett ljudhärmande uttryck som kan sägas fungera som ett onomatopoetiskt rörel-
severb. På samma sätt som förledet i -bilen avser ett utmärkande drag hos A-traktorer, refererar oontz,
oontz, oontza till repetitivt dunkande musik (jfr det äldre uttrycket dunka-dunka). Den repetitiva karaktä-
ren återspeglas i det språkliga uttrycket, med (minst) två oontz före verbformen oontza. Det senare följer
ett produktivt verbbildande avledningsmönster som typiskt har substantivrot. Är alltså oontz därmed är
ett substantiv?
I den här artikeln diskuterar vi ett par exempel på hur åtminstone delvis icke-verbala uttryck integreras
morfosyntaktiskt i svenskt skriftspråk och hur detta kan analyseras grammatiskt. Vi ifrågasätter ordklasser
som primärt analysverktyg i sammanhanget och förordar i stället ett konstruktionsbaserat perspektiv.
1The Emoji graphics used in this paper are taken from the Twemoji project, copyright 2020 Twitter Inc and other contributors,
licensed under CC BY 4.0, https://creativecommons.org/licenses/by/4.0/
Benjamin Lyngfelt and Joel Olofsson. 2022. Ordbildning på icke-verbal grund. Om integrering
av ikoner och ljudeffekter i grammatiken. In Volodina, Dannélls, Berdicevskis, Forsberg and
Virk (editors), Live and Learn – Festschrift in honor of Lars Borin, pages 83–87. Available
under CC BY 4.0 83
2 Ikoner och ordklasser
-bilen är en nominal sammansättning, dvs. bildar som helhet ett substantiv.2 Även efterledet, som är
sammansättningens semantiska och syntaktiska huvud, är ett substantiv. Däremot råder inga särskilda
ordklassrestriktioner på förledet. Det vanligaste är substantiv även här, [N-N], men förledet kan också
utgöras av t.ex. adjektiv (småbil), verb (hyrbil) eller som synes en ikon ( -bil). Redan när det gäller
lexikala förled är ordklasstillhörigheten inte alltid helt självklar; utgår t.ex. lastbil från substantivet last
eller verbet lasta, och lyxbil från substantivet lyx eller adjektivet lyxig?
Eftersom -bilen som sagt påminner om [N-N]-sammansättningar och ikonen betecknar ett föremål
kan det ligga nära till hands att uppfatta ikonen som ett substantiv. Samtidigt fungerar den också som en
varningssignal och liknar i det avseendet snarare en interjektion. Detta är ingen exceptionell egenskap hos
just , utan ikoner framstår i allmänhet som notoriskt flerfunktionella sett från ett grammatiskt perspektiv.
Exempelvis kan tolkas både som substantiv (flygplan) och som verb (flyga/resa), eller till och med
som ett helt yttrande (jag flyger till Indien). Ett annat exempel är som kan tolkas som substantiv (fest),
verb (festa), adjektiv (glad/partysugen) och/eller interjektion (Hurra!).
Att följa den traditionella klassificeringsprincipen “olika ordklass – olika ord” skulle alltså leda till
massiv homonymi bland ikonerna. Frågan är om ordklassbestämning alls är meningsfull för ikoner. Rim-
ligen avser ordklass hos en lexikal enhet en konventionaliserad uppsättning morfosyntaktiska egenskaper,
men vanliga ordklasskriterier är till föga hjälp här (Adesam & Bouma, 2019). Ikoner saknar helt morfo-
syntaktisk markering, svarar dåligt mot distributionstester, tenderar att vara flerfunktionella och konven-
tionaliseringen av deras grammatiska beteende är vanligtvis inte särskilt långt gången (även om vissa
drag kan ha konventionaliserats; jfr McCulloch, 2019). Enligt taggningsmodellen Koala, som används i
Språkbanken Text, klassas de helt enkelt som symboler (Adesam & Bouma, 2019).
Inte heller ljudeffekten oontz i uttrycket oontz, oontz, oontzade har någon tydlig ordklasstillhörighet.
Som fristående ljud liknar uttrycket närmast en interjektion, i likhet med ljudhärmande interjektioner som
bom, klang och smack. Samtidigt kan det i vissa avseenden bete sig som ett substantiv: ett/flera oontz.
Och i exempel (1) ingår det i ett verb. I Koala hanteras uttryck av det här slaget som foreign material,
utan närmare specificering (Adesam & Bouma, 2019).
Kort sagt, hur man tolkar ikoner och ljudhärmande uttryck, liksom för den delen andra språkliga enhe-
ter, beror i stor utsträckning på kontexten. Ordklasser, annars högst användbara för flera typer av språk-
vetenskaplig analys (jfr t.ex. Adesam & Bouma, 2019; Kalm 2021), är kanske inte det mest fruktbara
angreppssättet just i det här sammanhanget.
3 Avledningar
Verbavledning är synnerligen produktivt i svenskan; vi kan bilda verb genom att kombinera en verbän-
delse med snart sagt vad som helst som kan tolkas som någon sorts aktivitet. Kända exempel är vabb+a,
facebook+a och sol-och-vår+a, medan oontz, oontz, oontz+ade i exemplet ovan är en mindre etablerad
bildning. Avledning genom suffix antas i traditionell grammatik innebära ordklassbyte, men i likhet med
ikoner är det som sagt tveksamt om oontz, oontz, oontz kan sägas ha någon given ordklasstillhörighet.
Ordklassproblematiken blir ännu mer påtaglig om vi går ett steg till och bildar substantiv (ett evin-
nerligt oontz, oontz, oontzande) eller particip/adjektiv (ett oontz, oontz, oontzande dansgolv). Både s.k.
verbalsubstantiv och participer bildas typiskt av verb, men en sådan analys skulle här innebära en föga
plausibel tvåstegsprocess där man först bildar verbet oontz, oontz, oontza och sedan avleder detta verb
vidare till substantiv eller particip/adjektiv. För den nominala användningen passar således den latinska
termen nomen actionis bättre än den svenska verbalsubstantiv här – ett ‘nomen som uttrycker en aktion’
snarare än ett ‘substantiv bildat av ett verb’. Andra exempel på nomen actionis utan verbstam är hemma
hos-ande och GI-ande (se Holmer, 2022).
Om vi återgår till verbet kan oontz, oontz, oontzade sägas fungera som ett rörelseverb i exemplet; i
alla händelser ingår det i rörelseuttrycket oontz, oontz, oontzade förbi och fyller där den roll som typiskt
uttrycks av ett rörelseverb (jfr körde förbi). Det verkar för övrigt vara ett ganska produktivt mönster att
2Sammansättningar i svenskan behandlas utförligt av bl.a. Svanlund (2009) och Loenheim (2019); se även Teleman et al.
(1999).
84
använda ljudhärmande verb i rörelseuttryck; jfr susa förbi, skramla iväg och braka in i ngt (Olofsson,
2018). Med tiden kan associationen till rörelse lexikaliseras, som i fallet susa, men det skulle förvåna
om även oontz, oontz, oontza når dit. Ett tecken på en sådan utveckling skulle annars vara om uttrycket
reducerades till enbart oontza.
4 Diskussion: ordbildning genom konstruktioner
Hur ska vi då analysera oontz, oontz, oontzade och -bilen? Att postulera (tillfälliga) lexikala enheter
med tillhörande ordklassegenskaper m.m. förefaller både omständligt och implausibelt. Istället föreslår
vi att de nominala resp. verbala associationerna uppstår i kontexten (s.k. emergens), såsom antas inom
bl.a. konstruktionsgrammatik. Konstruktionsgrammatik är förmodligen mest känd för sin hantering av
olika typer av flerordsmönster, men har också framgångsrikt tillämpats på morfologi (t.ex. Booij, 2010).
Således kan nybildningarna i (1) hanteras som instanser av en sammansättningskonstruktion resp. en
verbkonstruktion.
4.1 Verb och sammansättningar som konstruktioner
En konstruktion består av ett eller flera konstruktionselement, som vardera fyller en viss funktion i kon-
struktionen och förknippas med en (mer eller mindre specificerad) uppsättning egenskaper. Om vi börjar
med verbkonstruktionen kan den sägas innehålla två element: dels en STAM, dels GRAMMATISK MARKE-
RING (GM) av främst tempus. I oontz, oontz, oontzade fungerar [oontz, oontz, oontz] som stam, medan
[-ade] utgör GM och markerar såväl verbkategori som preteritum. I vanliga fall utgörs förstås stammen
av ett etablerat verb, som redan förknippas med verbala egenskaper. Dessa verb utgör specifika typer av
konstruktionen och har lexikaliserat GM-dragen. I fallet oontz, oontz, oontzade är dock stammen inget
verb i sig, utan konstrueras som en aktivitet just genom att ingå i en verbkonstruktion. Skillnaden mellan
detta synsätt och mer strikt kompositionella modeller är alltså att konstruktionselementens egenskaper
inte måste vara lexikalt givna av de ingående leden, utan också kan följa av den konstruktion de ingår i.
Även rörelsebetydelsen antas följa av en konstruktion, närmare bestämt en syntaktisk konstruktionmed
elementen VERB och RIKTINGSADVERBIAL: oontz, oontz, oontzade förbi. Detta mönster uppträder som sagt
oftast med etablerade rörelseverb, men verb som inte har denna betydelse inherent kan alltså få det i just
den här konstruktionen. Det avgörande villkoret är att verbet uttrycker en betydelse som kan associeras
med rörelse, i det aktuella fallet ljud som hörs från den körande A-traktorn. Rörelsekonstruktioner av det
här slaget behandlas utförligt i Olofsson (2018).
På liknande sätt kan vi hantera -bilen. Uttrycket antas instansiera en nominal sammansättningskon-
struktion med elementen FÖRLED, EFTERLED och GM. Här uttrycker GM nominala egenskaper som t.ex.
definithet och fogas till efterledet. Eftersom nominala sammansättningar genomgående är determinativa
måste förledet kunna förstås som en specificering av efterledet. I övrigt lägger konstruktionen inga sär-
skilda restriktioner på vare sig ordklass eller andra egenskaper hos förledet, mer än att det rent praktiskt
måste kunna kombineras med efterledet.3
Om -bilen också ska förstås mer specifikt som en sammansättning av typen [N-N] fordras därutöver
att tolkas nominalt. Här finns inte utrymme för en utredning av begreppet ‘nominal funktion’ (jfr Tele-
man et al., 1999), men grovt förenklat kan det sägas innebära att -bilen tolkas analogt med firmabilen
och leksaksbilen snarare än med (det mer verbala) hyrbilen. Inte heller det kräver dock att analyseras
som ett substantiv, vare sig grammatiskt eller som lexikal enhet.
4.2 Unifiering och s.k. coercion
Utgångspunkten för analysen är att konstruktionerna utgör mönster, som vid språkanvändning instan-
sieras av konkreta språkliga uttryck. Rent tekniskt förenas konstruktionselementen och de uttryck som
realiserar dem genom s.k. unifiering (Fillmore & Kay, 1993; Fried & Östman, 2004). Elementen och de-
ras instansieringar behöver inte ha identiska egenskaper; det räcker att de är kompatibla. I de allra flesta
3Det ställs mer specifika krav på efterledet, som utgör både syntaktiskt och semantiskt huvud och behöver kunna kombineras
med ev. GM. Efterledet behöver alltså både formellt och funktionellt fungera som ett nominal. Därför fungerar -semestern
bättre än semester- -en; den oböjda varianten semester- är dock tänkbar.
85
fall är detta tämligen oproblematiskt. Mer intressant blir det när egenskaperna hos element och instansie-
ringar inte matchar fullt ut. Så är bl.a. fallet när verb uppträder i syntaktiska strukturer som krockar med
verbets valens, som i exemplet Kan man äta bort sin huvudvärk?. Objektet till äta uttrycker normalt det
ätna, vilket här knappast är huvudvärken. I stället associeras objektet med bort (det som ska bort) snarare
än tolkas som föremål för verbhandlingen. Denna tolkning uppstår just i konstruktionen [verb + bort +
objekt]; jfr ?äta sin huvudvärk (Sjögreen d.y., 2015).
Den här sortens fenomen, där språkliga uttryck används på sätt som krockar med deras konventionali-
serade egenskaper, brukar beskrivas i termer av coercion. I konstruktionsperspektiv innebär coercion att
en konstruktion ”kör över” lexikala drag hos de ingående leden, som därmed anpassas till villkoren på de
konstruktionselement de unifieras med (t.ex. Michaelis, 2005). Det kan alltså betraktas som coercion att
foga in som förled i en sammansättning trots att ikonen i sig inte har sådana morfologiska egenskaper.
Termen coercion må signalera att vi begår någon form av persuasivt våld på språket, men företeelsen är
egentligen bara ett specialfall av att språkliga uttryck anpassas till sin kontext – här i form av den språkliga
konstruktion uttrycket ingår i.
Konstruktionell coercion är emellertid ett kraftfullt verktyg, som utan begränsningar skulle öppna för
att övergenerera å det vildaste. Så varför producerar vi inte jämt och ständigt en uppsjö av konventions-
brytande nybildningar? Den viktigaste och mest generella begränsningen är förmodligen vanans makt. Vi
språkbrukare är inte så kreativa som vi kanske vill tro, i varje fall inte hela tiden, utan det är i regel smidi-
gast för både sändare och mottagare att använda vanliga, lättillgängliga och förväntade uttryckssätt. En
mer specifik faktor, som har föreslagits som förklaring till nyckfulla begränsningar i partiell produktivitet,
är konkurrens: ett uttryck kan vara disprefererat därför att det trängs undan av mer etablerade alternativ
(Goldberg, 2019).
Slutligen bör vi inte glömma att språkbruk inte enbart styrs av begränsningar. Det fordras också positi-
va krafter, såsom motivation och kontextuell relevans – en anledning att använda ett uttryck och kanske
någonting som gynnar valet av just det uttryckssättet. En gynnande kontext för bruk av ikoner i samman-
sättningar skulle t.ex. kunna vara en -skrift till Lars Borin.
Referenser
Yvonne Adesam & Gerlof Bouma. 2019. The Koala part-of-speech tagset. Northern European Journal of Lan-
guage Technology, 6(2):5–41.
Anders Björkvall. 2009. Den visuella texten. Multimodal analys i praktiken, volume 40 of Ord och stil. Språk-
vårdssamfundets skrifter. Hallgren & Fallgren, Stockholm.
Geert Booij. 2010. Construction Morphology. Oxford University Press, Oxford.
Charles J. Fillmore & Paul Kay. 1993. Construction grammar coursebook. Opublicerat manuskript. Department
of Linguistics, University of California, Berkeley.
Mirjam Fried & Jan-Ola Östman. 2004. Construction grammar: A thumbnail sketch. In Mirjam Fried & Jan-
Ola Östman, editors, Construction Grammar in a Cross-Language Perspective, volume 2 of Constructional
Approaches to Langu- age, pages 11–86. John Benjamins, Amsterdam.
Adele E. Goldberg. 2019. Explain me this: creativity, competition, and the partial productivity of constructions.
Princeton University Press, New Jersey.
Louise Holmer. 2022. Neutrala substantiv på -ande i text och ordbok. Ph.D. thesis, Göteborgs universitet.
Mikael Kalm. 2021. Ordklasser – finns de? In Johan Brandtler & Mikael Kalm, editors, Nyanser av grammatik:
gränser, mångfald, fördjupning, pages 23–42. Studentlitteratur, Lund.
Lisa Loenheim. 2019. Att tolka det sammansatta. Befästning och mönster i första- och andraspråkstalares tolkning
av sammansättning. Ph.D. thesis, Meijerbergs Arkiv för Svensk Ordforskning, Göteborg.
Gretchen McCulloch. 2019. Because internet: understanding the new rules of language. Riverhead Books, New
York.
86
Laura Michaelis. 2005. Entity and event coercion in a symbolic theory of syntax. In Jan-Ola Östman & Mirjam
Fried, editors, Construction Grammar(s): Cognitive Grounding and Theoretical Extensions, page 45–87. John
Benjamins, Amsterdam.
Joel Olofsson. 2018. Förflyttning på svenska. Om syntaktisk produktivitet utifrån ett konstruktionsperspektiv.
Ph.D. thesis, Göteborgs universitet, Göteborg.
Christian Sjögreen d.y. 2015. Kasta bort bollen och äta bort sin huvudvärk. En studie av argumentstrukturen i
kausativa bort-konstruktioner. Ph.D. thesis, Uppsala universitet, Uppsala.
Jan Svanlund. 2009. Lexikal etablering. En korpusundersökning av hur nya sammansättningar konventionaliseras
och får sin betydelse. Acta universitatis Stockholmiensis, Stockholm.
Ulf Teleman, Staffan Hellberg, & Erik Andersson. 1999. Svenska Akademiens grammatik. Svenska Akademien,
Stockholm.
87

Building a multilingual AWE tool for L2 learners: Challenges and ideas
Arianna Masciolini
Språkbanken Text
University of Gothenburg, Sweden
arianna.masciolini@gu.se
Abstract
Language tools are a valuable ICALL resource, especially for learners of a second language. In
this work, we discuss the challenges involved in the development of a multilingual Automatic
Writing Evalutation tool addressing their specific needs and brainstorm some potential solutions.
1 Introduction
It is common to identify ICALL (Intelligent Computer Assisted Language Learning) applications with
ILTSs (Intelligent Language Tutoring Systems). That of ICALL, however, is a broader category, which
also includes various types of language tools (Heift & Vyatkina, 2017), such as online dictionaries,
MT (Machine Translation) software, spell checkers and morphological analyzers. Usually designed not
specifically with learners in mind, but rather for the general population, these tools are especially impor-
tant - perhaps more than ILTSs - for intermediate and advanced L2 (Second Language) learners, who,
as opposed to students learning a FL (Foreign Language), tend to acquire language in its context of use,
often without receiving any formal instruction (Kramsch, 2000).
In this category, AWE (Automatic Writing Evalutation) tools have recently started emerging, Gram-
marly being perhaps the most widespread example.1 Based on an automatic analysis of the learner’s text,
AWE software can provide different kinds of feedback, from numerical scores to corrections and stylis-
tic suggestions (Hockly, 2018). Some AWE tools targeting teachers are used in standardized tests, while
tools like Grammarly, used both by natives and learners, are increasingly popular outside the classroom.
With such a large target group, however, they are not always able to address the specific needs of
learners in the best way possible. First, the majority of AWE tools targeting the writer (rather than the
grader) provide them with corrections but no explanations. Furthermore, the analyses these systems
perform are not specifically focused on L2 learner errors, which may well differ from those typical of L1
users of the same language. In addition, most AWE tools are monolingual, and adapting them to other
languages, when at all possible, requires a large amount of data, which is often unavailable.
We are interested in building an AWE tool that addresses these issues. We focus primarily on grammat-
icality, aiming for a system able to provide both corrections and verbal explanations in a potentially wide
variety of languages, targeting learners with different levels of proficiency and metalinguistic awareness.
We propose an approach where, after obtaining a correction hypothesis for the learner’s input, both the
original text and its correction are processed with an UD (Universal Dependencies) parser (de Marneffe
et al., 2021). The two obtained treebanks are then compared by an error analysis module, which outputs
a lossless structured representation of the errors it detects based on the discrepancies between them. Fi-
nally, these structured data are converted to human-readable feedback through a domain-specific CNL
(Controlled Natural Language).
In the following, we go through the various steps one by one, discussing potential solutions to the
problems the implementation of each of them arises. Section 2 is dedicated to the question of how to ob-
tain a good correction hypothesis. Section 3 discusses the challenging aspects of parsing while Section 4
1grammarly.com
Arianna Masciolini. 2022. Building a multilingual AWE tool for L2 learners: Challenges and
ideas. In Volodina, Dannélls, Berdicevskis, Forsberg and Virk (editors), Live and Learn –
Festschrift in honor of Lars Borin, pages 89–93. Available under CC BY 4.0 89
focuses on how to use the results of this step for error analysis. After that, Section 5 revolves around
generating metalinguistic feedback from structured data. We then close with some concluding remarks.
2 Obtaining a correction hypothesis
As mentioned in the Introduction, our system provides feedback based on a comparison between the user
input and a corrected version of the same text. The first processing step is therefore that of obtaining
a correction hypothesis. We use this expression to emphasize the fact that every correction is based on
some interpretation of what the writer meant to express through their text. As an example, consider the
ungrammatical sentence “*This are my *contribute to the Festschrift in honor of Lars Borin”. A possible
correction is “This is my contribution to the Festschrift in honor of Lars Borin”, but the author might have
instead meant to say that they made several contributions, the proper correction thus becoming “These
are my contributions to the Festschrift in honor of Lars Borin”.
For this reason, obtaining a correction hypothesis with a GEC (Grammatical Error Correction) tool,
is, while certainly an option to take into consideration, not necessarily optimal; the results obtained with
one such tool are not guaranteed to match the learner’s intentions and can therefore be confusing. Aside
from this, the amount of GEC software available is still quite limited and performance is uneven across
languages. When it comes to Swedish, for example, both more dated software such as the hybrid rule-
based/probabilistic tool Granska (Domeij et al., 2000) and the most recent neural approaches (Nyberg,
2022) still present some weaknesses, especially when it comes to longer sentences containing several
and/or multi-token errors.
In an interactive system, an alternative solution exploiting the learner’s L1 competence is to use MT.
This approach would consist in translating the user input to the learner’s L1 (or any other language
that they selected as the instruction language) and let them adjust the result to clarify their intentions.
The resulting L1 text would then be translated back to the L2, producing a correction hypothesis that
hopefully matches the learner’s expectations. Among several translation candidates, the closest one to
the original user attempt could be selected based on a metric such as the BLEU score (Papineni et
al., 2002). The learner’s L1 being Swedish, for instance, our example sentence could be automatically
translated to “Detta är mitt bidrag till Festschrift till Lars Borins ära”, but the user would then be given
a chance to intervene, specifying that they meant “Detta är mina bidrag till Festschrift till Lars Borins
ära.” Two advantages of back-and-forth translation are its awareness of the learner’s intentions and its
high multilinguality, as MT tools are nowadays available for a vast amount of language pairs. Translation
errors can of course pose problems, especially in the L1-to-L2 direction, where users cannot intervene,
but we expect this issue to be mitigated by the tendency of learner language to be relatively simple.
3 Parsing
For the morphosyntactic analysis of learner text, we propose using a UD parser, i.e. a dependency parser
outputting CoNNL-U files following the Universal Dependencies guidelines (de Marneffe et al., 2021).
As the name suggests, UD is a framework for cross-linguistically consistent grammatical annotation: the
scheme is largely identical across languages and even language-specific features are annotated following
shared principles. This significantly simplifies the task of working with several L2s, which is one of our
main ambitions. Furthermore, state-of-the-art UD parsers such as UDPipe (Straka, 2018) are remarkably
accurate, fast, open source and easy to use.
While processing correction hypotheses should therefore be unproblematic, learner language poses
significant challenges. In their systematic study of the performance of dependency parsers on learner
English, Huang et al. (2018) have shown that, while their overall accuracy stays reasonably high even
for L2 text, they are not robust to grammatical errors. The overall good scores seem in fact to occur
due to errors being sparse and learner sentences being shorter and simpler than those written by native
or otherwise highly proficient users of the same language, not to mention that the study only takes
dependency labels and Part Of Speech (POS) tags into account, thus not giving any information about
the accuracy of annotation when it comes to, for instance, incorrectly inflected items.
There have been some efforts to develop parsers specifically meant for learner language. Sakaguchi et
90
al. (2017), for instance, have proposed an error-repairing architecture capable of dealing with a variety
of single-token errors. Another, perhaps more straightforward approach could be training an existing UD
parser on manually or semi-automatically annotated learner data. As we will see in Section 4, some UD-
annotated parallel learner treebanks are in fact available, but both the number of languages for which
these resources already exist and their sizes are most likely insufficient. We suggest that, in our context
of application, the parsing of learner sentences could be informed by that of the corresponding correction
hypotheses. In practice, the annotation of learner attempts could consist in postprocessing the correspond-
ing corrected sentences annotated with a standard UD parser.
4 Analyzing errors
We propose framing error analysis, which lies at the heart of our hypothetical AWE tool, as a tree com-
parison task. More specifically, this means operating on a parallel learner treebank (or, to use a term
coined in Lee et al. (2017b), an L1-L2 treebank), i.e. a dependency corpus where learner sentences are
aligned with the corresponding correction hypotheses. This format was originally designed to address
the interoperability issues arising from the coexistence of different markup styles and tagsets used for
annotating learner corpora, usually employed to retreive occurrences of particular error patterns. The
idea is to, rather than defining a universal error taxonomy, simply annotate both learner sentences and
correction hypotheses according to the UD guidelines, to then retreive errors via tree queries. In our case,
parallel learner treebanks, so far usually handcrafted (Berzak et al., 2016; Lee et al., 2017a; Di Nuovo
et al., 2019), are to be obtained via automatic parsing (see Section 3).
With an approach similar to what we have in mind, parallel learner treebanks have been used to derive
error taxonomies dynamically (Choshen et al., 2020). The method is conceptually simple: given a portion
of a learner sentence containing a grammatical error and the corresponding correction, errors are found
by selecting the parsed learner substring node closest to the root and checking whether its counterpart in
the correction has the same UD label and POS tag or not. If that is not the case, an error has been found.
Its class is defined as the ordered pair of diverging UD labels or POS tags, token additions and deletions
being a special case where the pair lacks one of its elements. Of course, in this way errors that do not
involve an UD label or a POS tag change, such as incorrectly inflected words, are left out. To address
the issue, the FEATS field of CoNNL-U files, reserved for morphological analysis, is also taken into
account and errors of this kind are labeled with the morphological feature(s) they retain. Assuming that
“This is my contribution to the Festschrift in honor of Lars Borin” is a suitable correction of “*This are
my *contribute to the Festschrift in honor of Lars Borin”, then, the first (inflection) error would be
labelled Mood=Ind|Person=3|Tense=Pres|VerbForm=Fin (dropping “are”’s Number=Plur), while
the second’s category would be VERB→NOUN.
While Choshen et al. (2020)’s work is only concerned with classification, albeit fine-grained and dy-
namic, our error analysis module is intended to output lossless representations of each error. As we will
discuss in Section 5, these are to be converted into human-readable feedback of arbitrary granularity only
at a later stage, by a separate program, which implies machine-readability as another requirement for our
data format. In addition, it seems that the simple error labeling algorithm we described, implemented
as an open source program which we unsuccessfully tried to run, discards too much of the information
available in CoNNL-U files and do not see ways for it to work effectively if not on single-token errors.
The simplest solution could be representing errors as pairs of UD subtrees representing a fragment of
the learner’s sentence and its correction, aligned with a variant of the approach proposed by Masciolini
& Ranta (2021). However, this would mean carrying additional information such as the specific word
forms used and the features that remain the same even after correction has taken place. Not only is this
superfluous for the feedback step: it also prevents us from using our data format for the task, complemen-
tary to feedback generation, of error retrival, which was Lee et al. (2017b)’s original goal with L1-L2
treebanks. A good basis for defining a better error format could be hst, the Haskell-embedded DSL (Do-
main Specific Language) used for pattern matching UD trees in gf-ud (Kolachina & Ranta, 2016; Ranta
& Kolachina, 2017)2.
2The pattern matching language documentation can be found at github.com/GrammaticalFramework/gf-ud/blob/
91
5 Generating feedback
As mentioned in the introduction, automatic feedback comes in different forms. Since our hypothetical
AWE tool has the learners themselves as its intended users, we are interested in generating feedback
that they can understand and make use of to independently improve their texts and acquire new gram-
matical knowledge. In her study on the impact of corrective feedback on learner uptake, Heift (2004)
distinguishes three approaches commonly used in ICALL system: recasting, which implies showing cor-
rection hypotheses as replacement suggestions, highlighting, consisting in showing only the location
of the error(s), and providing metalinguistic feedback, i.e. giving verbal explanations of what causes a
sentence to be incorrect. While the former two do not require all of the processing steps we have so
far described, we focus on the latter, which has proven to be an especially effective way to incentivize
learners to reflect on their own errors and correct them (Heift, 2004).
In our setting, and in particular after the error analysis step, feedback generation can be seen as a data-
to-text conversion task. Crucially in a multilingual setting, metalinguistic feedback should be available
in several languages. Moreover, there should be a possibility to adjust the feedback to the learner’s
level of proficiency and/or metalinguistic awareness and, ideally, to also have the possibility to output
labels belonging to some error taxonomy, rather than extended explanations. For these reasons, we think
a CNL implemented in GF (Grammatical Framework), a well-established programming language for
multilingual grammar engineering, would be suitable for the job. In GF, grammars are composed of an
abstract syntax, playing the role of an interlingua, and one or more concrete syntaxes, capturing the
specificities of the various languages. Translating implies parsing a string in the source language to an
AST (Abstract Syntax Tree) and then linearizing it to a new string in the target language. Designed
for building multilingual applications, GF makes it relatively easy to develop semantic or application
grammars, i.e. domain-specific CNLs, by re-using rules defined by the large-scale syntactic grammars
of over 40 languages which constitute GF’s “standard library”, usually referred to as the RGL (Resource
Grammar Library) (Ranta, 2011).
Using GF, our task can become that of defining an application grammar that has the error descriptions
outputted by the error analysis module as one of the concrete syntaxes. To that, we can add an arbitrary
number of additional concrete syntaxes for verbal feedback in different languages and at different levels
of granularity, ranging from labels to exhaustive explanations. In terms of grammar engineering, this is
not a trivial task, as the definition of the abstract syntax resembles that of a novel, albeit more flexible,
error taxonomy, derived in this case from the actual errors, possibly incrementally, and not defined a
priori. An alternative path yet to be explored is trying to exploit the interoperability between UD and
GF (Kolachina & Ranta, 2016; Ranta & Kolachina, 2017): if errors are represented as (simplified) UD
subtrees, they could be automatically converted to GF ASTs, thus making the definition of a new abstract
syntax unnecessary.
6 Concluding remarks
In this work, we have expressed our interest in building a multilingual AWE tool for L2 learners. We
discussed some of the challenges associated with the task and brainstormed potential solutions. We hope
the open problems we presented, which will be the object of our future work, to have sparked the reader’s
interest in the topic.
References
Yevgeni Berzak, Jessica Kenney, Carolyn Spadine, Jing Xian Wang, Lucia Lam, Keiko Sophie Mori, Sebastian
Garza, & Boris Katz. 2016. Universal Dependencies for learner English. arXiv preprint arXiv:1605.04278.
Leshem Choshen, Dmitry Nikolaev, Yevgeni Berzak, & Omri Abend. 2020. Classifying syntactic errors in learner
language. arXiv preprint arXiv:2010.11032.
master/doc/patterns.md
92
Marie-Catherine de Marneffe, Christopher D. Manning, Joakim Nivre, & Daniel Zeman. 2021. Universal Depen-
dencies. Computational Linguistics, 47(2):255–308.
Elisa Di Nuovo, Cristina Bosco, Alessandro Mazzei, & Manuela Sanguinetti. 2019. Towards an Italian learner
treebank in Universal Dependencies. In 6th Italian Conference on Computational Linguistics, CLiC-it 2019,
volume 2481, pages 1–6. CEUR-WS.
Rickard Domeij, Ola Knutsson, Johan Carlberger, & Viggo Kann. 2000. Granska–an efficient hybrid system
for Swedish grammar checking. In Proceedings of the 12th Nordic Conference of Computational Linguistics
(NODALIDA 1999), pages 49–56.
Trude Heift & Nina Vyatkina. 2017. Technologies for teaching and learning L2 grammar. The handbook of
technology and second language teaching and learning, pages 26–44.
Trude Heift. 2004. Corrective feedback and learner uptake in CALL. ReCALL, 16(2):416–431.
Nicky Hockly. 2018. Automated writing evaluation. ELT Journal, 73(1):82–88.
Yan Huang, Akira Murakami, Theodora Alexopoulou, & Anna Korhonen. 2018. Dependency parsing of learner
English. International Journal of Corpus Linguistics, 23(1):28–54.
Prasanth Kolachina & Aarnte Ranta. 2016. From abstract syntax to universal dependencies. In Linguistic Issues
in Language Technology, Volume 13, 2016.
Claire Kramsch. 2000. Second language acquisition, applied linguistics, and the teaching of foreign languages.
The Modern Language Journal, 84(3):311–326.
John SY Lee, Herman Leung, & Keying Li. 2017a. Towards Universal Dependencies for learner Chinese. In
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017), pages 67–71.
John SY Lee, Keying Li, & Herman Leung. 2017b. L1-L2 parallel dependency treebank as learner corpus. In
Proceedings of the 15th International Conference on Parsing Technologies, pages 44–49.
Arianna Masciolini & Aarne Ranta. 2021. Grammar-based concept alignment for domain-specific machine trans-
lation. In Proceedings of the Seventh International Workshop on Controlled Natural Language (CNL 2020/21).
Martina Nyberg. 2022. Grammatical error correction for learners of Swedish as a second language. Master’s
thesis, Uppsala Universitet.
Kishore Papineni, Salim Roukos, Todd Ward, & Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of
machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguis-
tics, pages 311–318.
Aarne Ranta & Prasanth Kolachina. 2017. From universal dependencies to abstract syntax. In Proceedings of the
NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017), pages 107–116.
Aarne Ranta. 2011. Grammatical framework: Programming with multilingual grammars, volume 173. CSLI
Publications, Center for the Study of Language and Information Stanford.
Keisuke Sakaguchi, Matt Post, & Benjamin Van Durme. 2017. Error-repair dependency parsing for ungrammatical
texts. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2:
Short Papers), pages 189–195.
Milan Straka. 2018. UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In Proceedings of the CoNLL
2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 197–207, Brussels.
Association for Computational Linguistics.
93

Flexible Universal Dependencies using nested dependency graphs
Joakim Nivre
Department of Computer Science
RISE Research Institutes of Sweden
joakim.nivre@ri.se
Department of Linguistics and Philology
Uppsala University, Sweden
joakim.nivre@lingfil.uu.se
Abstract
Universal Dependendencies (UD) is a framework for cross-linguistically consistent morphosyn-
tactic annotation, which has to date been applied to 130 languages. Despite widespread adoption
of UD, there are a number of issues in the annotation guidelines that continue to raise debate,
such as the treatment of function words and the criteria for word segmentation. In this article, I
sketch an extension of the UD framework, which may allow these and other issues to be resolved
by offering more flexibility in the analysis of certain linguistic phenomena.
1 Introduction
Universal Dependencies (UD) is a project that develops cross-linguistically consistent morphosyntac-
tic annotation for many languages, with the goal of facilitating multilingual parser development, cross-
lingual learning, and parsing research from a language typology perspective (Nivre et al., 2016; Nivre et
al., 2020; de Marneffe et al., 2021). Since its start in 2014, the project has grown into a large community
effort with contributions from 503 researchers around the world, and the latest release (v2.10) features
228 annotated corpora representing 130 languages. In addition to their use in natural language processing
research, these resources are also increasingly being used for empirical studies in linguistic typology and
evolutionary linguistics, which indicates that there was a real need in the community for a cross-linguistic
standard for morphosyntactic annotation.
Nevertheless, questions have been raised about the appropriateness of certain design choices in UD,
and alternatives have been proposed in the literature. Most of these alternatives, however, are largely com-
patible with the overall approach of UD and differ only in the analysis of certain linguistic phenomena.
This is true, for example, of the most well-known alternative framework, Surface-Syntactic Universal
Dependencies (SUD) (Gerdes et al., 2018). This raises the question of whether it is possible to extend
UD into a framework that subsumes UD as well as a number of its proposed variants — a framework that
we might call Flexible Universal Dependencies (FUD).
In this short paper, I want to sketch one approach for realizing such a framework. The basic idea is to
extend UD representations from simple dependency trees to nested dependency graphs, where nodes are
not limited to atomic syntactic units corresponding to syntactic words but can themselves be dependency
graphs that provide compact representations of alternative structural analyses. This proposal draws on
a number of previous proposals, including the syntactic nuclei of Tesnière (1959), the bubble trees of
Kahane (1997), and the underspecified dependency representations of Schneider et al. (2013). On a more
abstract level, it can be seen as an elaboration of the notion of theory-supporting treebanks that I proposed
some twenty years ago (Nivre, 2003).
Joakim Nivre. 2022. Flexible Universal Dependencies using nested dependency graphs. In
Volodina, Dannélls, Berdicevskis, Forsberg and Virk (editors), Live and Learn – Festschrift in
honor of Lars Borin, pages 95–100. Available under CC BY 4.0 95
nsubj obl
aux obj case
aux nummod nummod
they may have killed two birds with one stone
PRON AUX AUX VERB ADP NOUN ADP DET NOUN
Figure 1: Basic UD representation with limited morphological annotation.
Figure 2: Japanese word segmentation schemes. Illustration from Murawaki (2019).
2 Annotation principles and issues
The linguistic theory underlying UD is based on two fundamental ideas: (a) the basic syntactic units are
words; and (b) syntactic structure consists of grammatical relations between words (de Marneffe et al.,
2021). This leads naturally to an annotation scheme where sentences are segmented into words, which are
annotated with morphological information in the form of lemmas, part-of-speech tags and morphological
features, andwhere the syntactic annotation takes the form of a dependency tree where the nodes represent
words and the arcs represent grammatical relations. This kind of annotation is illustrated in Figure 1.1
The choice of words as basic syntactic units is motivated by the lexical integrity principle (Chomsky,
1970; Bresnan & Mchombo, 1995; Aronoff, 2007), which states that words are built out of different
structural elements and by different principles of composition than syntactic constructions, and by the
belief that a word-based model will generalize better across languages than trying to segment words
into smaller units like morphemes. However, it is well known that cross-linguistically valid criteria for
word segmentation are hard to establish (Haspelmath, 2011), and that the application of such criteria is
especially challenging for languages which do not have a tradition of marking word boundaries in the
orthography. Japanese, for example, has at least three established standards for segmentation into word-
like units — known as short unit words (SUW), long unit words (LUW) and bunsetsus— and the choice
of an appropriate standard for UD annotation has been the subject of considerable discussion (Tanaka
et al., 2016; Asahara et al., 2018; Murawaki, 2019; Han et al., 2020; Omura et al., 2021). As a result,
some UD treebanks for Japanese now exist in several versions with different word segmentation schemes.
Figure 2 illustrates the three established standards as well as a fourth one proposed by Murawaki (2019)
as a better fit to the UD notion of syntactic word.
The dependency analysis adopted in UD gives priority to relations holding between the lexical heads
of predicates, arguments and modifiers in order to maximize parallelism across languages with differ-
ent structural characteristics. Major syntactic relations therefore typically hold directly between content
words, while function words are treated as grammatical markers on content words. This is illustrated in
1Since the morphological annotation is not relevant for the discussion in this paper, I have limited it to part-of-speech tags
in Figure 1 and will suppress it completely later.
96
mod
comp:obj comp:obj
subj comp:aux comp:aux mod mod
they may have killed two birds with one stone
PRON AUX AUX VERB ADP NOUN ADP DET NOUN
Figure 3: Basic SUD representation corresponding to the UD representation in Figure 1.
Figure 1, where the nominal subject relation (nsubj) holds between the main verb killed and the subject
pronoun they, while the auxiliary verbs may and have are treated as dependents of the main verb. Sim-
ilarly, the oblique modifier relation (obl) holds between killed and the noun stone, while the numeral
one and the preposition with are both dependents of stone. The primary motivation for giving priority to
relations between content words is that they are more likely to be parallel across languages, while function
words in one language may correspond to morphological inflection or nothing at all in other languages.
Regardless of the motivation, however, the treatment of function words has turned out to be one of the
most controversial aspects of UD, as it has been perceived as incompatible with syntactic theories that
treat function words as syntactic heads (Gerdes & Kahane, 2016; Osborne & Gerdes, 2019). This has
led to the development of a sister framework to UD known as Surface-Syntactic Universal Dependencies
(SUD) (Gerdes et al., 2018), which is described by its creators as near-isomorphic to UD but which differs
in particular by treating function words as heads in the dependency structure, as illustrated in Figure 3.2
It is important to note that, despite the obvious differences, the UD and SUD representations also have
several things in common. Disregarding for the moment the use of different labels for some relations in
SUD and UD (subj vs. nsubj, comp:obj vs. obj, and mod vs. obl and nummod), the two representations
give the same analysis of the direct object construction and the numerals, and also posit a subject relation
from the verb groupmay have killed to they and a modifier relation from killed to the prepositional phrase
with one stone. Similarly, in the case of Japanese word segmentation variants, the syntactic representa-
tions will be identical down to the largest word units, and will differ only in that the more aggressive
segmentation variants necessitate subtrees that are absent in other variants. In both cases, we may there-
fore ask whether we can design a richer syntactic representation from which all variants can be extracted.
This is the idea of Flexible Universal Dependencies (FUD).
3 Nested dependency graphs
The extended representation proposed here is based on two ideas. The first is to relax the tree constraint
and allow general dependency graphs,3 from which dependency trees can be extracted using spanning
tree algorithms familiar from the dependency parsing literature (McDonald et al., 2005). The second
idea is to allow these dependency graphs to be nested in the sense that a (smaller) dependency graph
can be a node in a (larger) dependency graph. Here is one way of formalizing these ideas, disregarding
dependency labels for the moment:
• Let S = w1, . . . , wn be a sentence segmented into a sequence of minimal syntactic units.
• We say that VS = {w1, . . . , wn} is the set of elementary nodes for S.
• A nested dependency graph for S is a directed graphG = (U,A), where every element ofU is either
1. an elementary node v ∈ VS , or
2. a nested dependency graph for a (proper) subsequence S′ of S with elementary node set VS′
such that U exactly covers S.
2SUD differs from UD not only in the treatment of function words, but we will concentrate on this difference here.
3The tree constraint in UD only holds for the basic syntactic representations; there is also an enhanced representation, which
is graph-structured and encodes implicit syntactic relations, which does not concern us here.
97
(n)subj
aux
aux
comp:aux
(comp:)obj
case
comp:aux (comp:)obj comp:obj
((num)mod (num)mod
they may have killed two birds with one stone
Figure 4: Flexible UD representation with nested dependency graphs.
• We say that a node set U exactly covers S if every elementary node v ∈ VS occurs exactly once in
U or (recursively) in one of its graph-structured nodes.
The notion of a nested dependency graph is probably best explained through an example. Figure 4 shows
a nested dependency graph for the sentence from Figure 1 and Figure 3. The top-level dependency
graph has two nodes, the elementary node they and a graph node covering the rest of the sentence. That
graph in turn has two elementary nodes (may, have) and one graph node, and so on. In order to extract
an ordinary dependency tree from this representation, we need to extract a rooted directed spanning
tree from each (nested) dependency graph. For the largest subgraph, we may pick the spanning tree
marked in blue, corresponding to a UD analysis, or the spanning tree marked in red, corresponding to
an SUD analysis. For the next smaller graph, there is only one spanning tree, which is common to both
frameworks, and for the smallest graph we can again choose between a UD (blue) or SUD (red) analysis.
If sentences are annotated with nested dependency graphs in this way, we can thus extract different styles
of analysis by assigning different weights to the arcs and use a maximum spanning tree algorithm. To
handle different word segmentation schemes, as in the Japanese example above, we instead have to choose
between extracting a spanning tree and collapsing a subgraph covering a potential word unit into a single
node.
4 Conclusion
In this paper, we have sketched a possible extension of the UD framework for syntactic annotation, which
allows different variants of the annotation schemes to be extracted from the same compact representations.
We have shown how this approach can be used to accommodate the different treatments of function words
in UD and SUD, as well as different word segmentation schemes in languages like Japanese. We believe
that it can also be used to resolve other issues in UD annotation, such as the analysis of (fixed) multiword
expressions and coordination, but that remains outside the scope of the paper. Needless to say, this simple
idea needs to be worked out in much more detail before it can be seriously considered for inclusion in UD,
but we hope that it can at least inspire others to think about ways to add more flexibility to the framework.
98
Acknowledgment
Warm thanks to Lars Borin for fruitful and constructive collaborations over the years and for two key
contributions to Universal Dependencies: pointing out that the Google universal part-of-speech tag set
was missing one of the few truly universal word categories — interjections — and helping us find a
meeting room for the UD kick-off meeting during EACL in Gothenburg 2014, as local co-chair. Thanks,
Lars, and happy anniversary!
References
Mark Aronoff. 2007. In the Beginning Was the Word. Language, 83:803–830.
Masayuki Asahara, Hiroshi Kanayama, Takaaki Tanaka, YusukeMiyao, Sumire Uematsu, ShinsukeMori, YujiMat-
sumoto, Mai Omura, & YugoMurawaki. 2018. Universal Dependencies Version 2 for Japanese. In Proceedings
of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
Joan Bresnan& SamA.Mchombo. 1995. The lexical integrity principle: Evidence fromBantu. Natural Language
and Linguistic Theory, 13:181–254.
Noam Chomsky. 1970. Remarks on Nominalization. In Roderick A. Jacobs & Peter S. Rosenbaum, editors,
Readings in English Transformational Grammar, pages 11–61. Ginn and Co.
Marie de Marneffe, Christopher D. Manning, Joakim Nivre, & Daniel Zeman. 2021. Universal Dependencies.
Computational Linguistics, 47:255–308.
Kim Gerdes & Sylvain Kahane. 2016. Dependency Annotation Choices: Assessing Theoretical and Practical
Issues of Universal Dependencies. In Proceedings of LAW X –The 10th Linguistic Annotation Workshop, pages
131–140.
Kim Gerdes, Bruno Guillaume, Sylvain Kahane, & Guy Perrier. 2018. Sud or surface-syntactic universal depen-
dencies: An annotation scheme near-isomorphic to ud. In Proceedings of the Second Workshop on Universal
Dependencies (UDW 2018), pages 66–74.
Ji Yoon Han, Tae Hwan Oh, Lee Jin, & Hansaem Kim. 2020. Annotation issues in Universal Dependencies for
Korean and Japanese. In Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020), pages
99–108, December.
Martin Haspelmath. 2011. The Indeterminacy of Word Segmentation and the Nature of Morphology and Syntax.
Folia Linguistica, 45:31–80.
Sylvain Kahane. 1997. Bubble Trees and Syntactic Representations. In Proceedings of the 5th Meeting of Mathe-
matics of Language.
Ryan McDonald, Fernando Pereira, Kiril Ribarov, & Jan Hajič. 2005. Non-Projective Dependency Parsing using
Spanning Tree Algorithms. In Proceedings of the Human Language Technology Conference and the Conference
on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 523–530.
Yugo Murawaki. 2019. On the Definition of Japanese Word. CoRR, abs/1906.09719.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajič, Christopher D. Manning,
Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, & Dan Zeman. 2016. Universal
Dependencies v1: A Multilingual Treebank Collection. In Proceedings of the 10th International Conference on
Language Resources and Evaluation (LREC), pages 1659–1666.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher D. Manning, Sampo Pyysalo,
Sebastian Schuster, Francis Tyers, & Dan Zeman. 2020. Universal Dependencies v2: An Evergrowing Multi-
lingual Treebank Collection. In Proceedings of the 12th International Conference on Language Resources and
Evaluation (LREC), pages 4034–4043.
Joakim Nivre. 2003. Theory-Supporting Treebanks. In Proceedings of the 2nd Workshop on Treebanks and
Linguistic Theories (TLT), pages 117–128.
Mai Omura, Aya Wakasa, & Masayuki Asahara. 2021. Word Delimitation Issues in UD Japanese. In Proceedings
of the Fifth Workshop on Universal Dependencies (UDW, SyntaxFest 2021), pages 142–150.
99
Timothy Osborne & Kim Gerdes. 2019. The Status of Function Words in Dependency Grammar: A Critique of
Universal Dependencies (UD). Glossa, 4(1):17.
Nathan Schneider, Brendan O’Connor, Naomi Saphra, David Bamman, Manaal Faruqui, Noah A. Smith, Chris
Dyer, & Jason Baldridge. 2013. A Framework for (Under)specifying Dependency Syntax without Overloading
Annotators. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse,
pages 51–60.
Takaaki Tanaka, Yusuke Miyao, Masayuki Asahara, Sumire Uematsu, Hiroshi Kanayama, Shinsuke Mori, & Yuji
Matsumoto. 2016. Universal Dependencies for Japanese. In Proceedings of the Tenth International Conference
on Language Resources and Evaluation (LREC’16), pages 1651–1658.
Lucien Tesnière. 1959. Éléments de syntaxe structurale. Editions Klincksieck.
100
Perspective_on: Semantic relations for frames and constructions
Miriam R L Petruck Alexander Ziem
International Computer Science Institute Heinrich-Heine-Universität Düsseldorf
Berkeley, CA, USA Düsseldorf, Northrhine-Westphalia, Germany
miriamp@icsi.berkeley.edu ziem@phil.uni-duesseldorf.de
Abstract
This paper considers the frame-to-frame relation Perspective_on in FrameNet, addressing its
importance for both Frame Semantics and Construction Grammar. Perspective_on highlights the
extent to which the continuum of lexicon and grammar is crucial for both theories. Developers
of frame-based lexical resources (Borin et al., 2010; Dannélls et al., 2021) and Constructicons,
i.e., repositories of grammatical constructions built on the principles of Construction Grammar
(Fillmore, 1988) have only begun to address this issue.
1 Introduction
FrameNet (Ruppenhofer et al., 2016), a research and resource development project grounded in the theory
of Frame Semantics (Fillmore, 1982; Fillmore, 1985), provides information about the mapping between
form and meaning in English. The organizing theoretical construct of FrameNet (FN) is the semantic
frame, i.e., a schematic representation of some scene, whose frame elements (FEs), or semantic roles,
identify participants and other conceptual entities in the scene that a sentence or an utterance describes.
Aside from the semantic information that a frame captures, FN links frames in its hierarchy with frame-
to-frame relations, including (among others) Perspective_on.
This paper considers the frame-to-frame relation Perspective_on in FrameNet, addressing its impor-
tance for both Frame Semantics and Construction Grammar. Perspective_on highlights the extent to
which the continuum of lexicon and grammar is crucial for both theories. Developers of frame-based
lexical resources (e.g., Borin et al. (2010), Dannélls et al. (2021)) and Constructicons, i.e., repositories of
grammatical constructions built on the principles of Construction Grammar (Fillmore, 1988) have only
begun to address this issue.1
2 Background to FrameNet and the FrameNet Constructicon
2.1 FrameNet
FrameNet is a unique knowledge base that maps meaning to form through the theory of Frame Seman-
tics (Fillmore, 1982; Fillmore, 1985). The FrameNet database includes frame descriptions for over 1,200
semantic frames, also understood as script-like structures that provide background knowledge for the use
and understanding of words in (a) language and facilitate inferencing about participants and events, more
than 13,000 lexical units (LU), each of which is a lexical construction, and nearly 200K manually anno-
tated sentences in Frame Semantic terms. Moreover, FrameNet captures additional semantic information
about relations between frames with a set of frame-to-frame relations.
Aside from the significance of the intellectual accomplishment, FrameNet data serve as training data
for downstream natural language processing applications, such as question-answering, event tracking,
and information extraction, to name but a few.
1A very early version of this paper was presented at the International Construction Grammar Conference (ICCG-8) in
Osnabrueck, 2014. Space limitations preclude the inclusion of the entire presentation, although the authors intend to expand on
the current work in the future.
MiriamR.L. Petruck andAlexander Ziem. 2022. Perspective_on: Semantic relations for frames
and constructions. In Volodina, Dannélls, Berdicevskis, Forsberg and Virk (editors), Live and
Learn – Festschrift in honor of Lars Borin, pages 101–105. Available under CC BY 4.0 101
Relation Super_frame Sub_frame
Inheritance Parent Child
Subframes Complex Component
Precedes Earlier Later
Using Parent Child
Perspective_on Neutral Perspectivized
See_also Main Entry Referring Entry
Metaphor Source Target
Inchoative_of Inchoative State
Causative_of Causative Inchoative/State
Table 1: Frame-to-Frame relations in FrameNet
2.2 The FrameNet Constructicon
A constructicon is a structured repository of grammatical constructions, interrelated as a (mostly) tax-
onomic network linking the most general of constructions (as in a family of constructions) to its most
specific type (Diessel, 2019). A complete constructicon (for a single language) would include the full
range of construction types, from highly schematic non-lexical constructions to meaningful argument-
structure constructions, as well as partly idiomatic constructions, complex words, and even morphemes.
The FrameNet Constructicon holds approximately 75 constructions, with detailed information about
the kinds of linguistic material that can occur in specifiable positions within each construction, as well
as those positions within which said construction may occur (Fillmore et al., 2012). Thus, developers
of the FrameNet Constructicon first identified and defined constructions, along with their construction
elements, then analyzed and labeled constructs that illustrate each construction. For example, consider
the be_recip construction, examples of which appear below, where example 1 is the asymmetrical version
of the construction and example 2 is the symmetrical version.
1. I know that [Chuck INDIVIDUAL_1] is friends [with Paul INDIVIDUAL_2].
2. I know that [Chuck and Paul INDIVIDUALS] are friends.
In this construction the head noun, which is used as a predicate must be a term that denotes a reciprocal
relationship, e.g., partners, coworkers, etc. and that the with-prepositional phrase in example 1 is not
predictable. Simplifying matters for the current purposes, example 1 shows the construction elements
INDIVIDUAL_1 and INDIVIDUAL_2, while example 2 shows the construction element INDIVIDUALS.
In addition, the construction evokes the Reciprocality frame, which FN has characterized as states-
of-affairs with Protagonists in relations with each other that may be viewed symmetrically. When these
frame elements are equally prominent, each equally serving to identify the other, they manifest together
as PROTAGONISTS. When one of them defines the other (similar to a Ground), it is PROTAGONIST_2,
with the other called PROTAGONIST_1 (i.e., the Figure).2
3 Frame-to-Frame relations in FrameNet
The so-called FrameNet hierarchy3 links frames through nine frame-to-frame semantic relationships,
which Table 1 displays.4
Ruppenhofer et al. (2016) defines, explains, and illustrates all of FN’s frame-to-frame relations;5 here
we focus exclusively on Perspective_on. This frame-to-frame relationship assumes the existence of a
neutral parent frame and two child frames, each providing a different point of view on the neutral parent
2See Lee-Goldman & Petruck (2018) for a detailed description of this construction.
3Structurally, the FrameNet hierarchy is most similar to a lattice. See Valverde-Albacete (2005) for further information.
4For expediency, FrameNet represents Inchoative_of and Causitive_of as frame-to-frame relations, not lexical relations
(which they are). See Petruck et al. (2004).
5Limitations of space preclude presenting and illustrating all of the frame-to-frame relations listed in Table 1. See Ruppen-
hofer et al. (2016, p. 79-85) for discussion of all of FN’s semantic relations.
102
Figure 1: Employment_start
unperspectivized frame. For example, consider the Employment_start, parent frame, two of whose
children are Hiring and Get_a_job, where the former provides the EMPLOYER’s point of view and the
latter provides that of the EMPLOYEE. 6
Using FrameGrapher, FN’s visualization tool, Figure 1 depicts the Employment_start frame along
with several related frames, including Hiring and Get_a_job, each related via Perspective_on to its
parent (displayed with pink arrows).
4 Perspective_on as a relation between constructions
The understanding that Perspective_on relates frames in the FN hierarchy to each other when those
frames capture two points of view (or perspectives) allows exploiting the relation to capture semantic
relations between constructions, not only between frames.
Consider the correlation between active and passive, as in example 3 and 4, respectively, where the
passive 4 shifts the point of view from the SELLER, Chuck, to the GOODS, the car.
3. I know that [Chuck SELLER] sold [the car GOODS] for $1000.
4. I know that [the car GOODS] was sold [by Chuck SELLER] for $1000.
Much the way Perspective_on profiles an event on the lexical level in the FrameNet lexicon, so too
does the relation operate on the level of constructions, thus also hinting at the similarity between lexical
and grammatical constructions.
Perspective_on is also useful for relating constructions that involve multiple affected entities, as in
the Double_object construction. Consider examples 5 and 6, where the two versions of that construction
make use of the same frame elements, namely BUYER and GOODS, in different syntactic realizations.
The syntactic realization of the BUYER in a PP-to phrase changes the semantic profiling in the sentence
from BUYER, Jerry, to the GOODS, car.
5. Chuck sold [Jerry BUYER] [the car GOODS] for $1000.
6. Chuck sold [the car GOODS] [to Jerry BUYER] for $1000.
The data in examples 5 and 6 demonstrate the usefulness of the frame-to-frame relation Perspective_on
for semantic profiling. More generally, the two sets of examples (3 and 4 along with 5 and 6), also suggest
that the utility of Perspective_on as a relation between constructions is not limited to just one (type of)
construction.
5 Related work
Not surprisingly, developers of FrameNet resources and Constructicons have addressed the issues of (1)
the relationships between constructions and frames, as well as (2) the relationships between construc-
tions. This section briefly discusses some of the most immediately relevant works (presenting them in
chronological order of their publication).
6This example derives from (Ruppenhofer et al., 2016, p. 9).
103
In an effort to control the connections between frames and constructions in the FrameNet Brasil
database, Torrent et al. (2014) outlined policies for the annotation of constructions, specifically for the
identification and labeling of construction elements, in the Brazilian Portugese Constructicon. The mo-
tivation behind these policies includes remaining faithful to the principles of Frame Semantics and Con-
struction Grammar. Additionally, adopting and implementing the policies recognize the continuum of
lexicon and grammar (Fillmore, 2008). Although the work focused on relations between construction
elements and frame elements, it also drew attention to families of constructions, which necessarily
requires the analyst to consider the relations between (or among) members of one such family.
Somewhat similarly, Ohara (2018) also addressed relations between frames and constructions in the
Japanese FrameNet Constructicon, albeit from a different perspective from that of Torrent et al. (2014),
where the latter concerned annotation practices for using a combined tool that facilitates work on both
frames and constructions. Ohara (2018) concerns distinguishing between (1) frame-based description and
annotation of semantico-syntactic structures of lexical units (and multiword expressions that FrameNet
has defined as such) and (2) constructicon annotation for describing the internal and external syntax and
semantics of linguistic objects having complex structures. Importantly, this work introduces a frame-
based classification of constructions, which developers of constructicons might employ to facilitate de-
termining relations between (or among) constructions.
In developing the German Constructicon (https://gsw.phil.hhu.de/constructicon/), Boas & Ziem (2018)
took a contrastive approach and considered German constructions in relation to their English analogs.
One goal of the work was to leverage existing construction entries in the FrameNet Constructicon for
English (Fillmore et al., 2012) to develop the German Constructicon. That goal required defining and
exploiting the notion of a continuum of constructional correspondence. The work is more about relations
between constructions across two genetically related, yet typologically distinct languages (at least in
terms of the morphology and the syntax of each language, as Kastovsky (2011) suggests), than it is
about semantic relations between constructions within a single language, namely German. Nonetheless,
the developers of the German Constructicon clearly know about relations between constructions in a
constructicon. As such, even if only by implication, in directing the reader’s attention to correspondences,
Boas & Ziem (2018) also asks the reader to attend to semantic relations between constructions in the
German Constructicon effort (Ziem et al., 2019).
While the above briefly described works do not address relations between constructions directly
within a single constructicon, they strongly suggest that the extended community of frame-based and
construction-based resource developers is aware of the necessity of addressing relations between con-
structions. Perhaps, the next round of development in building constructicons will include attention to
that necessity.
6 Concluding remarks
The structured event formalism for representing FrameNets informal descriptions in Chang et al. (2002)
also offered a way of handling linguistic focus, which FrameNet has implemented with the semantic
relation Perspective_on. Building on that insight, this paper addressed the application of one frame-to-
frame relation for describing semantic relations between constructions, also in constructicons in general.
Aside from the relatively recent work that addresses relations between frames and constructions in
the context of constructicon development, this paper also suggests that Construction Grammarians more
generally (not just those involved in constructicon development) might investigate existing frame-to-
frame relations for relating different types of constructions, beyond that of Inheritance, which Fillmore
(1999) already identified as playing an important role in Construction Grammar.
References
Hans Christian Boas & Alexander Ziem. 2018. Constructing a constructicon for German. In Benjamin Lyn-
gfelt, Lars Borin, Kyoko Ohara, & Tiago Timponi Torrent, editors, Constructicography, pages 183–228. Johns
Benjamins, Amsterdam and Philadelphia.
104
Lars Borin, Dana Dannélls, Markus Forsberg, Maria Toporowska Gronostaj, & Dimitrios Kokkinakis. 2010. The
past meets the present in Swedish FrameNet++. In 14th EURALEX International Congress, pages 269–281.
Nancy Chang, Srini Narayanan, & Miriam R. L. Petruck. 2002. Putting frames in perspective. In Proceedings of
the 19th International Conference on Computational Linguistics - Volume 1, COLING, pages 1–7, Stroudsburg,
PA. ACL.
Dana Dannélls, Lars Borin, & Karin Friberg Heppin. 2021. The Swedish FrameNet++: Harmonization, integra-
tion, method development and practical language technology applications. Number 14 in Natural Language
Processing. John Benjamins Publishing Company, Amsterdam and Philadelphia.
Holger Diessel. 2019. The Grammar Network: How Linguistic Structure Is Shaped by Language Use. Cambridge
University Press, Cambridge.
Charles J. Fillmore, Russell Lee-Goldman, & Russell G. Rhodes. 2012. The FrameNet constructicon. In Hans C.
Boas & Ivan A. Sag, editors, Sign-based Construction Grammar, pages 309–372. CSLI.
Charles J. Fillmore. 1982. Frame semantics. In Linguistics in the Morning Calm, pages 111–138. Linguistics
Society of Korea, Seoul.
Charles J. Fillmore. 1985. Frames and the semantics of understanding. Quaderni di Semantica, 6(2):222–254.
Charles J. Fillmore. 1988. TheMechanisms of Construction Grammar. In Proceedings of the 14th Annual Meeting
of the Berkeley Linguistics Society, pages 35–55.
Charles J Fillmore. 1999. Inversion and constructional inheritance. Lexical and constructional aspects of linguis-
tic explanation, 1:113–128.
Charles J. Fillmore. 2008. Border conflicts: FrameNet meets Construction Grammar. In Elisenda Bernal & Janet
DeCesaris, editors, Proceedings of the XIII EURALEX International Congress.
Dieter Kastovsky. 2011. Typological differences between English and German morphology and their causes. In
Toril Swan, Endre Mørck, & Olaf Jansen Westvik, editors, Language Change and Language Structure: Older
Germanic Languages in a Comparative Perspective, pages 135–158. De Gruyter Mouton.
Russell Lee-Goldman & Miriam R. L. Petruck. 2018. The FrameNet constructicon in action. In Benjamin
Lyngfelt, Lars Borin, Kyoko Ohara, & Tiago Timponi Torrent, editors, Constructicography, pages 19–40. Johns
Benjamins, Amsterdam and Philadelphia.
Kyoko Ohara. 2018. Relations between frames and constructions: A Proposal from the Japanese FrameNet
Constructicon. In Benjamin Lyngfelt, Lars Borin, Kyoko Ohara, & Tiago Timponi Torrent, editors, Construc-
ticography: Constructicon Development across Languages, pages 141–164. Johns Benjamins, Amsterdam and
Philadelphia.
Miriam R. L. Petruck, Charles J. Fillmore, Collin F. Baker, Michael Ellsworth, & Josef Ruppenhofer. 2004. Re-
framing FrameNet data. In G. Williams & S. Vessier, editors, Proceedings of The 11th EURALEX International
Congress, pages 405–416, Lorient.
Josef Ruppenhofer, Michael Ellsworth, Miriam R. L. Petruck, Christopher R. Johnson, Collin F. Baker, & Jan
Scheffczyk. 2016. FrameNet II: Extended Theory and Practice. ICSI: Berkeley.
Tiago Timponi Torrent, Ludmila Lage, Thais Fernandes Sampaio, Tatiane Tavares, & E. Matos. 2014. Revisiting
border conflicts between FrameNet and construction grammar: Annotation policies for the Brazilian Portuguese
constructicon. Constructions and Frames.
Francisco J. Valverde-Albacete. 2005. Explaining the structure of FrameNet with concept lattices. In Bernhard
Ganter & Robert Godin, editors, Formal Concept Analysis, pages 79–94, Berlin, Heidelberg. Springer Berlin
Heidelberg.
Alexander Ziem, Johanna Flick, & Phillip Sandkühler. 2019. The German Constructicon Project: Framework,
methodology, resources. Lexicographica, 35:15–40.
105

Natural language processing for educational applications:
Recent advances
Ildikó Pilán
Norwegian Computing Center
Oslo, Norway
pilan@nr.no
Abstract
We summarize recent advances in Natural Language Processing applied to topics related to the
educational domain based on the latest publications from a major conference in the field. The
three topics we touch upon include feedback, difficulty assessment and question generation.
1 Introduction
The 60th Annual Meeting of the Association for Computational Linguistics (ACL) and the co-located
workshops, held in May 2022, included a number of articles applying Natural Language Processing
(NLP) to the educational domain. Most of these studies are centered around three topics: feedback, dif-
ficulty rating and question generation. In the following three sections, we provide an overview of the
relevant articles, which all target English language data, except for two studies involving German re-
sources. We do not include a separate, more in-depth discussion on recent research published at work-
shops dedicated to educational NLP topics since the interested readers can find relevant papers more
easily in those proceedings compared to the broad and vast body of research published at more generic
NLP conferences.
2 Feedback
In the past years, there has been an increasing focus on developing automated assessment systems pro-
viding explainable and understandable feedback that goes beyond a mere correct-incorrect response, the
need for which has been, in fact, emphasized also in previous work (Deeva et al., 2021). Filighera et al.
(2022) took a step in this direction by creating a dataset of content-focused elaborated feedback for an
automatic short answer grading system. The dataset consists of (i) learner responses; (ii) the reference
answers; (iii) a score and (iv) detailed feedback explaining that score. The dataset contains more than
2000 responses in both German and English to 8 and 22 different questions respectively within the topic
of a college-level communication networks course. The authors included also baselines created by fine-
tuning a T5 Transformer language model (Raffel et al., 2020) on their dataset. They found that these
baselines improve on a majority baseline, but still perform considerably more poorly than humans. How-
ever, they also pointed out that measures such as BLEU and ROUGE fail to capture content similarity
between manual and automatic feedback.
Kaneko et al. (2022) proposed an example-based Grammatical Error Correction (GEC) system, which
uses retrieved example sentences for both generating more accurate corrections compared to traditional
GEC systems, and for providing an explanation to language learners about their errors. For each error,
a pair of correct and incorrect sentences similar to the original, learner-written sentence, is shown by
the system. These serve as a kind of indirect explanation to learners about why their sentence might
be incorrect. The authors conducted also a user study where they found that providing language learners
with examples helped them to decide whether to accept or refuse the automatically suggested corrections.
Ildikó Pilán. 2022. Natural language processing for educational applications: Recent advances.
In Volodina, Dannélls, Berdicevskis, Forsberg and Virk (editors), Live and Learn – Festschrift
in honor of Lars Borin, pages 107–109. Available under CC BY 4.0 107
3 Difficulty
Research on readability and difficulty in general, especially targeting English, seems to have somewhat
decreased in recent years, but we can still find a few examples in this directions.
Steinmetz & Harbusch (2022) presented EasyTalk, an interactive support system for German-speaking
low-literate users with intellectual or developmental disabilities. The system helps users write coherent
and correct text suitable for their proficiency level. Words can be supplemented with symbols and users
are reminded to complement their texts with information covering wh-questions as well as conjunctions
improving coherence. The authors have also conducted a small case study with low-literate users where
they investigated writing behaviour with eye-tracking recordings and found that participants used most
parts of the system for redacting, but to a minor extent for adding connectors.
It it worth noting that there is a growing interest in writing assistants in the broader educational tech-
nology community, including NLP. In fact, at ACL 2022, a whole workshop was dedicated to the topic
of writing assistants1, where the above mentioned study was also published. The workshop aimed to
be a meeting place for researchers in NLP, human-computer interaction as well as writers and industry
practitioners. Dialogue among representatives with such a broad set of expertise has a good potential to
spark research in the area that is relevant for user needs.
Another set of experiments related to difficulty but of materials presented to learners, not written
by them is described in Byrd & Srivastava (2022), who investigated predicting the difficulty of natural
language questions based on a question answering (QA) dataset (HotPotQA) and Item Response Theory
(IRT). The advantage of this psycometric tool is that it defines difficulty in a straightforward manner,
namely having a 50% chance of answering a question correctly. To simulate a variety of responses for
their study, the authors created an artificial crowd by training a QA model with varying amounts of data
and epochs. Questions were of a variety of topic (e.g. entertainment, biology) and type (yes/no, wh-
questions). The experiments showed that yes/no questions had a more consistent difficulty level. Textual
features were also correlated to difficulty with different non-neural models, which showed that commas
and complex words were among the most important features for determining question difficulty. Such
automatic question difficulty assessment enables the development of question generation systems where
a desired difficulty level can be specified.
4 Question generation
The third educational NLP theme at ACL 2022 centered around the automatic generation of questions.
Ghanem et al. (2022) present a novel question generation dataset and system generating inferential ques-
tions targeting specific comprehension skill types. Compared to extractive questions, which has been the
focus of previous work (Murakhovs’ka et al., 2022), inferential questions measure better learner under-
standing. The authors created and released a dataset annotated with story-based reading comprehension
skills that makes it possible to train systems able to generate questions with explicit control for such
skills. The dataset contains 726 children’s stories and is annotated for the following skill types: Basic
Story Elements, Character Traits, Close Reading, Figurative Language, Inferring, Predicting, Summariz-
ing, Visualizing and Vocabulary. Each type has an average of ca. 5 question-answer pairs per story.
Another interesting study explores the use of summaries for alleviating a major problem in reading
comprehension question generation, namely irrelevant or un-interpretable questions (Dugan et al., 2022).
Instead of the original text, the authors experimented with providing a T5-based question generation
model with human-written summaries. The passages used were chapters from a well-known NLP hand-
book. Their results indicated that the 3 annotators participating in the evaluation accepted a considerably
higher proportion of questions generated by their method than relying only on the original textbook text
passages themselves. Questions were also more relevant and interpretable without a larger context. In the
lack of human-written summaries, using automatic summaries still led to improved question generation
in the experiments presented.
1https://in2writing.glitch.me/
108
5 Conclusion
As the studies presented above show, recent research in educational NLP has been branching out to areas
and tasks which remained less explored previously, such as question generation and feedback. However,
they mostly focus on English as a target language, which in part is most likely due to the availability of
more resources both as unannotated data for pre-training the most recent transformer models, as well as
datasets annotated for specific tasks.
Some studies also included human evaluations which, understandably, remained somewhat limited in
size. They represent, nonetheless, an important step towards understanding better the performance and
the usability of the proposed systems, a much needed aspect in the educational domain.
References
Matthew Byrd & Shashank Srivastava. 2022. Predicting Difficulty and Discrimination of Natural Language
Questions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume
2: Short Papers), pages 119–130, Dublin. Association for Computational Linguistics.
Galina Deeva, Daria Bogdanova, Estefanía Serral, Monique Snoeck, & Jochen De Weerdt. 2021. A review of
automated feedback systems for learners: Classification framework, challenges and opportunities. Computers
and Education, 162:104094.
Liam Dugan, Eleni Miltsakaki, Shriyash Upadhyay, Etan Ginsberg, Hannah Gonzalez, DaHyeon Choi, Chuning
Yuan, & Chris Callison-Burch. 2022. A Feasibility Study of Answer-Agnostic Question Generation for Ed-
ucation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1919–1926, Dublin.
Association for Computational Linguistics.
Anna Filighera, Siddharth Parihar, Tim Steuer, Tobias Meuser, & Sebastian Ochs. 2022. Your Answer is Incor-
rect... Would you like to know why? Introducing a Bilingual Short Answer Feedback Dataset. In Proceedings
of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
8577–8591, Dublin. Association for Computational Linguistics.
Bilal Ghanem, Lauren Lutz Coleman, Julia Rivard Dexter, Spencer von der Ohe, & Alona Fyshe. 2022. Question
Generation for Reading Comprehension Assessment by Modeling How and What to Ask. In Findings of the As-
sociation for Computational Linguistics: ACL 2022, pages 2131–2146, Dublin. Association for Computational
Linguistics.
Masahiro Kaneko, Sho Takase, Ayana Niwa, & Naoaki Okazaki. 2022. Interpretability for Language Learners
Using Example-Based Grammatical Error Correction. In Proceedings of the 60th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1: Long Papers), pages 7176–7187, Dublin. Association for
Computational Linguistics.
Lidiya Murakhovs’ka, Chien-Sheng Wu, Philippe Laban, Tong Niu, Wenhao Liu, & Caiming Xiong. 2022.
MixQG: Neural Question Generation with Mixed Answer Types. In Findings of the Association for Compu-
tational Linguistics: NAACL 2022, pages 1486–1497, Seattle. Association for Computational Linguistics.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei
Li, & Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.
Journal of Machine Learning Research, 21(140):1–67.
Ina Steinmetz & Karin Harbusch. 2022. A text-writing system for Easy-to-Read German evaluated with low-
literate users with cognitive impairment. In Proceedings of the First Workshop on Intelligent and Interactive
Writing Assistants (In2Writing 2022), pages 27–38, Dublin. Association for Computational Linguistics.
109

Detecting fake papers with the latent algorithm for recursive search
Stian Rødven-Eide Ricardo Muñoz Sánchez
University of Gothenburg University of Gothenburg
stian.rodven.eide@gu.se ricardo.munoz.sanchez@gu.se
Abstract
The current avalanche of fake scientific papers appearing on respected websites such as
snarXiv.org has become of much concern for researchers. After concluding that existing methods
for distinguishing between real and fake scientific papers leave much to be desired, we have de-
veloped the Latent Algorithm for Recursive Search for this purpose. Our proposed method is not
only able to identify fake scientific papers with extremely high accuracy, but can also automati-
cally retract the identified papers.
1 Introduction
• Intrinsic Structure Reduction for Reversing Sentiment in Investigative Journalism,
• SELMA: A Novel Approach to Assessing the Novelty of Novels, and
• Understanding Universal Undercurrents in Unappreciated Unions – a Semantic Understudy.
These are just some of the titles behind which blatantly fake papers hide, purporting to be scientific, but
in essence offering nothing of essence. Some might be automatically generated, some created by internet
trolls for a laugh and a half, and some carefully constructed to cause chaos in the community.
Inspired by computational models of hide-and-seek, such as Latent Dirichlet Allocation, Latent Se-
mantic Analysis and Latent Discriminant Analysis, we have developed a novel state-of-the-art algorithm
to address this growing problem. Our Latent Algorithm for Recursive Search (LARS) has proven itself
capable of detecting even the fakest papers. In the following sections, we describe how and why this is
possible, detailing the inner workings of LARS, and presenting the results in comparison to other recent
advances.
2 Related work
With the widespread appearance of false information following the onset of the COVID-19 pandemic,
much research has been devoted to identifying false information on several media outlets. One of the
biggest issues we face is that humans are not good at detecting misinformation, with even experts picking
falling for both fake news,1 and academic papers.2
One of the tasks that has grown in recent years is that of fake news detection. Oshikawa et al. (2020)
note in their survey that most datasets tend to be relatively small due to the need to obtain fact-checked
articles. This leads to most papers dealing with a simple binary classification task.
While the field of detecting fake academic articles has not received much attention yet, there have
been some recent attention brought to them since the groundbreaking work of Baldassarre (2020). This
1https://www.thedailybeast.com/fooled-by-the-onion-9-most-embarrassing-fails
2https://www.sciencealert.com/cultural-studies-sokal-squared-hoax-20-fake-papers
Stian Rødven-Eide and Ricardo Muñoz Sánchez. 2022. Detecting fake papers with the latent
algorithm for recursive search. In Volodina, Dannélls, Berdicevskis, Forsberg and Virk (edi-
tors), Live and Learn – Festschrift in honor of Lars Borin, pages 111–113. Available under
CC BY 4.0 111
paper notes how important it is to have a good peer-reviewing system, to check for the veracity of the
data, and to go through the cited literature. It further notes how these processes can break down, leading
to non-serious papers being accepted into otherwise academic outlets.
Two of the more interesting approaches of late are the FFF-method, as proposed by Borrs et al. (2022),
and the Agreeable Knowledge algorithm, developed by Larn et al. (2022). What both of these have in
common, and that which we have found it sound to rely on, is a simultaneously holistic and recursive un-
derstanding of the nature of scientific publishing: The assumption that any given scientific paper attempts
to replicate itself through self-absorption, as well as insert itself into as many other papers as possible.
We call this the Vital Viral Vector (VVV).
3 Methodology
In order to exploit the VVV for LARS, its direction in the scientiverse must first be established. This
is achieved by finding the non-trivial zeros in the following function, where x is a representation of our
document:
∞ 1
ζ (x) = ∑ x
n=1 n
Once we have the vector of all non-trivial zeros of the previous equation (the VVV embedding), we
insert it into a recursive matrix, where the endpoints consist of the matrix itself. We then backpropagate
by finding the eigenvalues of the space of non-real papers. Essentially, if D is the domain where all
interesting and novel papers can be found and B is the boundary where papers become false, we define
its Euler-Lagrange equations as follows:
1
ϕ(VVV ) = lim √ e−VVV/b
∫ b→0 |b| VVV ∫
Q[ϕ ] = p(X)∇ϕ ·∇ϕq(X)ϕdX + σ(S)ϕ 2dS
D B
Lastly, we apply a latency filter that separates true and false upon dimensionality reduction. This can
be done by integrating the hyperbolic Riemannian representation of the filter
[℘(Q)]2 = 4[℘(Q)]3−g2℘(Q)−g3
This is most easily done by finding g1, g2, and g 3 3 33 such that g1 + g2 = g3 and applying a logistic
regression to℘(Q).
We now have the final score, signifying the trueness and/or thruthiness of the scrutinised paper. The
only step left is retraction, which is done through undetected infiltration of the infected papers – taking
it down from the inside, so to speak.
4 Evaluation
For four different datasets, we ran LARS as well as the two most prominent alternative methods, FFF
(Borrs et al., 2022) and FAKE (Larn et al., 2022), the results of which are available in Table 2. The
datasets we used are part of the shared task Fake Methods for Finding Fake Papers, which took place att
BCL in 2021 (Borscht & Goulash, 2021). These are detailed in Table 1. All our results are, as you can
see, much better than those provided by other methods.3 If you look closely at the F1-scores, you will
see that the result for snarXiv is 1.01. The reason for this is that this paper, the one you are reading, was
implicitly (and latently) included into the test set upon running the algorithm, and successfully identified
as well.
3At least the ones we tested.
112
Dataset Genre Documents Tokens
Onion-stories News 1,742,009 1,742,010
Old York Times Olds 34 34,000,000
Borin-collection Scientific papers 9,999 9,999,999
snarXiv.org Unscientific papers 4,321 1,234,567
Table 1: Datasets from the shared task.
Method Onion OYT Borin snarXiv
FFF 0.73 0.77 0.80 0.71
FAKE 0.79 0.78 0.79 0.77
LARS 0.99 0.99 1.00 1.01
Table 2: F1-scores for fake paper detection.
4.1 Recursive testing
An unfortunate side-effect of LARS is that it automatically evaluates any paper in which it is mentioned.
The result of that evaluation is then inserted into that paper as a subsection named Recursive Testing. The
likelihood that this paper is as fake as the articles it set out to scrutinise is 97.8%.
5 Conclusion and future work
As we can see, this paper is as fake as they come. However, considering the humorous nature of its nature,
we, naturally, regard this as entirely natural.
References
Daniel T Baldassarre. 2020. What’s the deal with birds? Scientific Journal of Research and Reviews, 2(4).
Lina Borrs, Boris Larn, & Biron Rals. 2022. Finding false flags – and other anomalous analogies. Journal of
False Disinformation, 14(3):123–234.
Johannes Borscht & Maria Goulash. 2021. Shared task: Fake methods for finding fake papers. In Proceedings of
the 132th Workshop on Fake Publishing, pages 52–57, Georgetown, Guyana. Brotherhood for Computational
Linguistics.
Boris Larn, Lina Borrs, & Biron Rals. 2022. Finding agreeable knowledge evenly. Journal of Diffusion Through
Confusion, 15(4):234–345.
Ray Oshikawa, Jing Qian, & William Yang Wang. 2020. A survey on natural language processing for fake
news detection. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 6086–6093.
European Language Resources Association.
113

Leksikalsk-semantiske sprogressourcer:
Hvad kan de, og hvordan udvikler vi dem bedst?
Bolette Sandford Pedersen Sanni Nimb
Center for Sprogteknologi Det Danske Sprog- og Litteraturselskab
Københavns Universitet, Danmark Danmark
bspedersen@hum.ku.dk sn@dsl.dk
Sussi Olsen
Center for Sprogteknologi,
Københavns Universitet, Danmark
saolsen@hum.ku.dk
Abstract
The paper discusses the status of lexical-semantic language resources in the era of statistical
language models. We argue that a lot of essential background information about culture, society
and the surrounding world is available via these resources; information which cannot be deduced
from text alone, but which is crucial for language interpretation. We describe the Danish scena-
rio and describe the series of resources that have been compiled for Danish in a joint venture
between lexicographers and NLP researchers. For two decades, three such resources have been
developed (a wordnet, a framenet and a sentiment lexicon), all with identifier links to the same
sense inventory, namely that of Den Danske Ordbog. We also present a new resource, the COR
lexicon, which draws on these existing lexical resources but attempts to create an easy-to-use,
joint semantic lexicon for AI developers which has a more coarse-grained sense inventory and
a core set of semantic information types. Finally, we argue that lexical semantic resources for
NLP should ideally be integrated as part of a larger lexicographical infrastructure with the aim
of easing future scaling and maintenance.
1 Leksikalske sprogressourcer og sprogmodeller
Leksikalske sprogressourcer der beskriver ordenes betydning og rolle fra forskellige perspektiver, har
været centrale byggesten i mange sprogteknologiske applikationer igennem de seneste tiår. De har imid-
lertid også udgjort flaskehalsen i mange systemer fordi vi har haft vanskeligt ved at opnå tilstrækkelig
dækningsgrad og tilstrækkelig konsistens til at de smidigt kunne indgå i interaktion med fx formelle
grammatiker, eller senere, med statistiske sprogmodeller.
Uafladeligt har vi måttet falde tilbage på teknikker der på bedste beskub kan håndtere såkaldte out-
of-vocabulary (OOV)-problematikker, altså at et givent ord ikke er beskrevet i ressourcen. Det skyldes
dels at ordforrådet i et sprog hele tiden udvikler sig, og at en del ord, især de sammensatte, ofte skabes
dynamisk på stedet i specifikke kontekster. Men det skyldes også i nok så høj grad at sprogressourcer til
NLP ofte er blevet udviklet en anelse stedmoderligt og uden for de leksikografiske miljøer, dvs. uden den
reelle leksikografiske faglighed og det setup som er nødvendigt for at kunne udvikle og vedligeholde en
solid og konsistent leksikalsk ressource.
Vi argumenterer for at leksikalsk-semantiske ressourcer fortsat bør spille en central rolle i NLP, også i
en tid hvor neurale sprogmodeller har bragt os langt med ren tekststatistik. De leksikalske ressourcer kan
Bolette Sandford Pedersen, Sanni Nimb and Sussi Olsen. 2022. Leksikalsk-semantiske spro-
gressourcer: Hvad kan de, og hvordan udvikler vi dem bedst?. In Volodina, Dannélls,
Berdicevskis, Forsberg and Virk (editors), Live and Learn – Festschrift in honor of Lars Borin,
pages 115–120. Available under CC BY 4.0 115
noget vigtigt, og de rummer noget viden som vi ikke kan eller bør undvære i vores sprogteknologiske
tjenester, som helst skal være inkluderende og tillidsskabende for alle.
Der er næppe nogen tvivl om at sprogmodeller også fremover vil være biased i forhold til de tekster
de baseres på, og at tekster i øvrigt ikke rummer den viden om sproget og verden som er nødvendig hvis
NLP skal kunne levere en dybere og mere praktisk anvendelig sprogforståelse (Bender et al., 2021).
Et velkendt og illustrativt eksempel er black sheep-problematikken (cf. Van Durme (2009)) som be-
skriver det problem der opstår hvis vi spørger en statistisk sprogmodel hvilken farve et får har. Her får
vi for de fleste sprog svaret sort. Dette selv om vi alle ved at det sorte får er undtagelsen der bekræfter
reglen om at får som de er flest, er hvide eller grå.
Det er præcis baggrundsviden af denne type som de leksikalsk-semantiske ressourcer beskriver, og
derfor er de vigtige at inddrage.1 Selv om det sorte får er en metafor, typisk for mennesker der ikke
følger den slagne vej, er eksemplet i al sin enkelthed godt fordi det illustrerer hvad vi skriver om, set
relativt til den verden som agerer baggrund for det vi skriver om. Den mest selvfølgelige baggrundsvi-
den er med andre ofte ikke eksplicit i teksten, og derfor har de statistiske sprogmodeller svært ved at
indfange den. Dette skaber problemer i mange sammenhænge hvor statistiske sprogmodeller anvendes
til sprogforståelse, parallelt med at der også ses et overraskende tungt bias mod fx demografiske stereoty-
per og kønsstereotyper i state-of-the-art sprogmodeller. Sidstnævnte problemstilling er i høj grad blevet
adresseret i flere nyere videnskabelige artikler (se fx Kurita et al. (2019) og Sólmundsdóttir et al. (2022)
for kønsbias i maskinoversættelse og NLP mere generelt) og i pressen fordi problemet med sådanne bias
er så åbenlyst.
Den anden problematik med manglende baggrundsviden fremstår derimod endnu ikke så tydeligt, men
er måske nok så alvorlig. Ordbøgerne med deres semantiske beskrivelser dækker selvfølgelig ikke denne
baggrundsviden alene, men de udgør en vigtig brik i det samlede billede.2
Udover at argumentere for at de neurale sprogmodeller bør beriges med mere basal baggrundsviden, ar-
gumenterer vi også for hvorfor leksikalske-semantiske ressourcer ikke bør udvikles isoleret fra et sprogs
øvrige leksikografiske virke men derimod bør ses som en naturlig del af eller “spinn-off” på den leksiko-
grafiske virksomhed der i øvrigt foregår i et givent sprogmiljø.
Det store samarbejdsprojektet ELEXIS (Krek et al. (2018); https://elex.is/), som vi afsluttede i 2022,
har kun bekræftet denne tilgang. Projektet er lykkedes med at skabe en kobling mellem de leksikogra-
fiske miljøer og udvalgte NLP-miljøer i Europa og har arbejdet henimod at åbne og standardisere de
“traditionelle” ordbøger sådan at de informationer de rummer, i højere grad kan komme i spil i NLP.
Selv om de intellektuelle rettigheder i de leksikografiske miljøer stadig vanskeliggør fuld udnyttelse af
ordbøger til NLP, er dette et vigtigt skridt på vejen.
Nedenfor opsummerer vi vores eget arbejde med leksikalsk-semantiske sprogressourcer som det har
udviklet sig henover de seneste to årtier, se også Pedersen et al. (2021). Det drejer sig i essensen om det
danske WordNet, DanNet, det Danske FrameNet og det Danske Sentimentleksikon. Alle er udviklet med
fast base i Den Danske Ordbogs betydningsinventar og beskæftiger sig med hhv. det paradagmatiske, det
syntagmatiske og det konnotative perspektiv af ordenes betydning. Endelig beskriver vi hvordan vi i et
nyligt igangsat ordbogsprojekt for kunstig intelligens samler de væsentligste oplysningstyper fra de tre
ressourcer i et samlet Centralt OrdRegister for Dansh (COR).3
Det er vigtigt for os i denne sammenhæng at nævne at kollegers arbejde ved Språkbanken og på
de tilsvarende svenske semantiske ressourcer som SALDO-ordbogen, det Svenske FrameNet og den
svenske sentimentordbog SenSALDO gennem årene har været en stor inspiration for vores arbejde. Det
har været interessant at se hvordan man har grebet arbejdet an ved Språkbanken, og hvordan man har
fordelt indsatsen med at styrke det tilsvarende svenske sprogområde på NLP-området.
1Den Danske Ordbog og DanNet har fx følgende definition på får: mellemstor drøvtygger med meget kraftig, oftest hvidlig
uldpels.
2Encyklopædisk viden udgør en anden dimension af væsentlig baggrundsviden om end grænsedragningen mellem det se-
mantiske og det encyklopædiske ikke altid er lige klar.
3Dette igangværende projekt finansieres af Digitaliseringsstyrelsen som en del af en satsning på kunstig intelligens i Dan-
mark.
116
2 Et sæt af danske leksikalsk-semantiske ressourcer med fast forankring i Den Danske
Ordbog
2.1 DanNet
Samarbejdet mellem et leksikografisk og et sprogteknologisk miljø i Danmark blev grundlagt i nuller-
ne med igangsættelse af DanNet-projektet (Pedersen et al., 2009). I dette projekt arbejdede Center for
Sprogteknologi (CST) ved Københavns Universitet og Det Danske Sprog- og Litteraturselskab (DSL)
sammen om at udvikle et omfattende dansk wordnet baseret på Den Danske Ordbogs (DDO) definitio-
ner. Genus proximum i DDO, der i forvejen var identificeret i ordbogs-manuskriptet, blev aksen hvorom
et ordnetværk kunne genereres semiautomatisk og derefter justeres af sprogteknologer og leksikografer.
Ressourcen rummer i dag knap 70.000 begreber organiseret i synsets, som er forsynet med bl.a. onto-
logiske typer og overbegreber. Samlet set rummer ressourcen mere end 300.000 indbyrdes semantiske
relationer, og den udvides løbende med flere betydninger.
2.2 Begrebsordbog
Erfaringerne med at udarbejde et wordnet på basis af DDO var afgørende for tilblivelsen af den senere
danske tesaurus, kaldet Den Danske Begrebsordbog (Begrebsordbogen), der blev skrevet i årene 2010-
2015 på DSL. I Begrebsordbogen tildeles betydningerne i DanNet yderligere en emnebetegnelse, og
begreber listes i semantisk rækkefølge, opdelt i grupper af synonymer og nærsynonymer der indledes af
et nøgleord, ofte et overbegreb. Samtidig blev flere DDO-lemmaer og -betydninger tilføjet så op mod
95 % af DDO er repræsenteret. Med Begrebsordbogens færdiggørelse åbnede der sig en række nye
muligheder for at sammenkoble onomasiologiske og semasiologiske leksikalske oplysninger fra de tre
ressourcer: et WordNet, en ordbog og en tesaurus.
2.3 Det Danske FrameNet-leksikon
De tematiske oplysninger og semantiske undergrupper i Begrebsordbogen blev kombineret med DDOs
valensmønstre. Disse data dannede grundlag for udarbejdelsen af et dansk framenet-leksikon med fokus
på verbalbetydninger. Da Begrebsordbogen både beskriver kollokationer fra den korpusbaserede DDO
og ofte tildeler mere end ét tematisk afsnit til den enkelte verbalbetydning (dvs. beskriver betydnin-
ger anskuet fra forskellige vinkler), udgjorde datamaterialet et velegnet grundlag for tildeling af mulige
frame-værdier fra den internationale standard Berkeley FrameNet til de enkelte ordbetydninger, igen med
bevarelse af id-numrene fra DDO.
Ud fra Begrebsordbogens kapitler kunne overordnede kategorier, fx alle verber og verbalsubstanti-
ver der omhandler kommunikation, behandles i samme ombæring så inventaret af mulige frames blev
overskueligt. Leksikonet er tænkt som leksikografisk hjælp til semantisk ramme- og rolleopmærkning af
danske tekster, men giver i sig selv værdifuld formaliseret semantisk information om de enkelte ordbe-
tydninger (Nimb et al., 2017; Nimb, 2018).
2.4 Den Danske Sentimentlexikon
Inspireret af SenSALDOS sentimentleksikon (Rouces et al., 2018a; Rouces et al., 2018b) blev der også
udarbejdet et sentimentleksikon ud fra Begrebsordbogens afsnitsopdeling. I afsnit hvor ordene så ud til at
have inhærent polaritet baseret på afsnitsbetegnelsen, blev disse automatisk opmærket med hhv. positiv
og/eller negativ polaritet. Ud af Begrebsordbogens 888 afsnit drejede det sig om 122 negative afsnit, fx
“Tristhed”, 80 positive afsnit, fx “Beundre”, samt 12 afsnit hvor polariteten var uklar, fx “Omdømme”.
De automatisk opmærkede ord blev derpå manuelt valideret. Ord uden polaritet blev tildelt værdien 0.
400 ord blev opmærket af to leksikografer for at måle annotørenigheden som var 0,83 (Cohens kappa).
Graden af polaritet, +3 til -3 blev tilføjet ved at sammenligne dels med et tidligere dansk sentimentlek-
sikon, AFINN, dels med ordets synonymer og nærsynonymer fra Begrebsordbogen idet den semantiske
rækkefølge var bevaret i udtrækket. På lemmaniveau blev ord med divergerende polaritet derefter nær-
mere undersøgt, og det blev besluttet om en betydning eller et helt lemma skulle slettes. De neutrale ord
blev fjernet fra listen. Resultatet er et sentimentleksikon med knap 14.000 polaritetsbærende lemmaer,
117
heraf 62 % negative og 38 % positive (Nimb et al., 2022). Denne fordeling ligger i øvrigt tæt op ad for-
delingen i det svenske SenSALDO-leksikon. Sentimentleksikonet er større end de allerede eksisterende
sentimentleksikoner for dansk (for en evaluering se Schneidermann & Pedersen (2022)).
2.5 COR Centralt OrdRegister for dansk
I COR-projektet udvælges semantiske kerneoplysninger fra DanNet, FrameNet-leksikonnet og senti-
mentleksikonnet. Det betyder i korte træk at ressourcen har information om antal betydninger, ontologisk
type, nærmeste danske overbegreb (fra DanNet), semantisk verbalramme (fra FrameNet) samt konnota-
tion (positiv/negativ) fra Sentimentordbogen.
Udgangspunktet er denne gang 60.000 lemmaer i den danske retskrivningsordbog (RO) der nummerin-
dekseres og forsynes med formaliserede morfologiske oplysninger, og som efterfølgende lanceres som
en frit tilgængelig standardressource, administreret af Dansk Sprognævn (se også Nimb et al. (i trykken)).
Tanken er at fremtidige sprogteknologiske ordbøger kan koble sig på vha. id-numrene; dermed sikres
en mere effektiv deling af danske leksikalske data. For en stor del af RO-ordforrådet tilkobles et be-
tydningsinventar udarbejdet på basis af oplysningerne i de semantiske ressourcer nævnt ovenfor: DDO,
DanNet, Begrebsordbogen, FrameNet-leksikonnet og sentimentleksikonnet. COR-ordbogen er semasio-
logisk i sin opbygning: Alle DDO-betydninger af et lemma tages i betragtning, ikke kun dem der fx er
beskrevet i DanNet. Kun de væsentligste betydninger medtages imidlertid: Sjældne, gammeldags eller
faglige DDO-betydninger af lemmaet udelades i COR, og samtidig slås nært beslægtede betydninger i
DDO sammen så der opnås en mere grovkornet betydningsinddeling der egner sig bedre til sprogtekno-
logisk anvendelse.
I udviklingen af ordbogen anvendes en række automatiske metoder der sikrer ensartet behandling på
tværs af ordforrådet, og reduceringen af antal betydninger udføres automatisk for en del af de polyseme
ord på baggrund af håndopmærkede data (Pedersen et al., 2022). Ordbogen, der lanceres i december
2023, vil omfatte ca. 30.000 lemmaer; ca. 11.000 af disse er udvalgt som værende særligt centrale i
dansk ud fra enten deres kobling til centrale begreber udvalgt for engelsk4 eller ud fra deres egenskab af
nøgleord i Begrebsordbogen.
3 Afrundende bemærkninger
Det er særdeles omkostningstungt at udvikle leksikalsk-semantiske ressourcer til sprogteknologi. Særligt
i de mindre sprogsamfund har vi slet ikke råd til ikke at få fuldt udbytte af disse i den teknologi der
udvikles. Som vi har påpeget her, indeholder de semantiske ressourcer vigtig viden om bl.a. kultur og
samfund, som ikke nødvendigvis fremgår af sprogmodeller som er bygget ud fra ren tekst. Hvis vi skal
dybere ned i sprogforståelsen i fremtidig sprogteknologi, er det derfor nødvendigt at inkludere den viden
som ressourcerne rummer.
Problemer med kuratering, opskalering og/eller manglende vedligehold har tidligere medført at nogle
NLP-ressourcer udviklet i mindre projekter med kortvarig finansiering ikke altid er blevet fuldt udnyttet i
et større perspektiv. De har simpelthen været for svære at integrere og anvende og har været for løsrevne
fra andre, mere almene ordbøger, som typisk (og forhåbentlig) har kuratering, vedligeholdelse og opda-
tering med som en del af deres udvikling og finansiering. Derfor har vi i denne artikel også argumenteret
for at leksikalsk-semantiske ressourcer til sprogteknologi bedst og sikrest udvikles inden for rammerne
af en leksikalsk infrastruktur og i samarbejde med de klassiske ordbøger som et sprogsamfund investerer
midler i i forvejen. I den forbindelse har vi også fremhævet ELEXIS-projektet (elex.is) som et vigtigt
netværk der arbejder henimod at muliggøre sådan en infrastruktur både i et nationalt og et internationalt
perspektiv.
4Der er taget udgangspunkt i det engelske core wordnet som udgør 5000 centrale begreber på engelsk, og som kan downlo-
ades her: https://wordnetcode.princeton.edu/standoff-files/core-wordnet.txt.
118
Referencer
Emily M Bender, Timnit Gebru, Angelina McMillan-Major, & Shmargaret Shmitchell. 2021. On the
Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM
Conference on Fairness, Accountability, and Transparency, pages 610–623.
Simon Krek, Iztok Kosem, John P McCrae, Roberto Navigli, Bolette S Pedersen, Carole Tiberius, &
Tanja Wissik. 2018. European Lexicographic Infrastructure (ELEXIS). In Proceedings of the XVIII
EURALEX International Congress on Lexicography in Global Contexts, pages 881–892.
Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, & Yulia Tsvetkov. 2019. Measuring bias in
contextualized word representations. In Proceedings of the First Workshop on Gender Bias in Natural
Language Processing, 166172. Florence, Italy: Association for Computational Linguistics, pages 166–
172.
Sanni Nimb, Anna Braasch, Sussi Olsen, Bolette Sandford Pedersen, & Anders Søgaard. 2017. From
Thesaurus to FrameNet. In Electronic Lexicography in the 21st century: Proceedings of eLex 2017
conference, pages 1–22.
Sanni Nimb, Sussi Olsen, Bolette S Pedersen, & Thomas Troelsgård. 2022. A Thesaurus-Based Senti-
ment Lexicon for Danish–The Danish Sentiment Lexicon. In Proceedings of the 13th International
Conference on Language Resources and Evaluation (LREC12), Marseille, pages 2826–2832.
Sanni Nimb, Bolette Sandford Pedersen, Nicolai Hartvig Hau Sørensen, I. Flörke, Sussi Olsen, & T. Tro-
elsgård. (i trykken). COR-S den semantiske del af Det Centrale OrdRegister (COR). LexicoNordica
29, Nordisk Forening for Leksikografi.
Sanni Nimb. 2018. The Danish FrameNet Lexicon: method and lexical coverage. In Proceedings of the
International FrameNet Workshop at LREC, Miyazaki, pages 51–55.
Bolette Sandford Pedersen, Sanni Nimb, Jørg Asmussen, Nicolai Hartvig Sørensen, Lars Trap-Jensen,
& Henrik Lorentzen. 2009. DanNet: the challenge of compiling a WordNet for Danish by reusing a
monolingual dictionary. Language Resources and Evaluation, 43(3):269–299.
Bolette Sandford Pedersen, Sanni Nimb, & Sussi Olsen. 2021. Dansk betydningsinventar i et dataling-
vistisk perspektiv. Danske Studier 2021, Universitets-Jubilæets danske Samfund 2021, pages 72–106.
Bolette S. Pedersen, Nicolai C.H. Sørensen, Sanni Nimb, S. Flörke, Sussi Olsen, , & Thomas Troelsgård.
2022. Compiling a Suitable Level of Sense Granularity in a Lexicon for AI Purposes: The Open-
Source COR Lexicon. In Proceedings of the 13th International Conference on Language Resources
and Evaluation (LREC12), Marseille, pages 51–60.
Jacobo Rouces, Lars Borin, Nina Tahmasebi, & Stian Rødven Eide. 2018a. Defining a Gold Standard for
a Swedish Sentiment Lexicon: Towards Higher-Yield Text Mining in the Digital Humanities. InCEUR
Workshop Proceedings vol. 2084. Proceedings of the Digital Humanities in the Nordic Countries 3rd
Conference Helsinki, Finland, March 7-9, 2018, pages 219–227.
Jacobo Rouces, Lars Borin, Nina Tahmasebi, & Stian Rødven Eide. 2018b. SenSALDO: Creating a
sentiment lexicon for Swedish. In Proceedings of the Eleventh International Conference on Language
Resources and Evaluation (LREC 2018), Miyazaki, ELRA, pages 4192–4198.
Nina Skovgaard Schneidermann & Bolette Sandford Pedersen. 2022. Evaluating a New Danish Senti-
ment Resource: the Danish Sentiment Lexicon, DSL. In Proceedings for SALLD2 - 2nd Workshop on
Sentiment Analysis and Linguistic Linked Data. European Language Resources Association., pages
19–25.
119
Agnes Sólmundsdóttir, Dagbjört Guðmundsdóttir, Lilja Björk Stefánsdóttir, & Anton Karl Ingason. 2022.
Mean Machine Translations: On Gender Bias in Icelandic Machine Translations.
Benjamin D Van Durme. 2009. Extracting Implicit Knowledge from Text. PhD Thesis. University of
Rochester.
120
Den ena texten och den andra:
Visualisering av textpar med hjälp av ordmoln
Maria Skeppstedt1, Gunnar Eriksson2, Magnus Ahltorp2 och Rickard Domeij2
1Centrum för digital humaniora Uppsala, Institutionen för ABM, Uppsala universitet
maria.skeppstedt@abm.uu.se
2Språkrådet, Institutet för språk och folkminnen
Abstract
Word clouds are commonly used for visualising the content of text collections. We here propose
a slight update of the standard word cloud that also visualises similarities between pairs of texts.
We apply the method to a text collection consisting of 39 text lyrics and visualise similarities
between four text pairs.
1 Inledning
Ordmoln, det som på engelska kallas ”tag clouds” eller ”word clouds” är ett väldigt populärt sätt att
visualisera innehållet i en text, eller de taggar som är associerade med en text. Den enklaste formen av
ordmoln visar orden med en större font ju mer frekvent förekommande de är i texten/bland ordtaggarna,
och arrangerar sedan orden i ett moln, exempelvis i alfabetisk ordning (Viégas & Wattenberg, 2008).
Det finns färdiga webbtjänster för att generera ordmoln, vilket skulle kunna vara en anledning till deras
popularitet. En annan anledning skulle kunna vara att ordmoln är en väldigt enkel och lättförståelig
visualisering. Det har dock riktats kritik mot de klassiska ordmolnen, exempelvis för att betraktaren lätt
kan tolka in en betydelse i hur orden är placerade, även när en sådan innebörd saknas (Barth et al., 2014).
Att använda fontstorlek som indikation på ett ords signifikans kan också vara problematiskt, i och med
att långa ord då lätt kan uppfattas som mer viktiga, i och med att deras längd gör att de tar upp mer plats
i ordmolnet (Viégas & Wattenberg, 2008).
Det finns många olika varianter av standardversionen av ordmolnen. Andra kriterier än ordfrekvens,
såsom TF-IDF-måttet (Barth et al., 2014), kan användas. Det finns även olika metoder för att placera
orden i molnet så att placeringen får en innebörd, exempelvis Context-preserving word cloud visualisa-
tion (Barth et al., 2014) och t-SNE, d.v.s. ”t-distributed stochastic neighbour embedding” (Schubert et al.,
2017). Det finns också flera utökningar av ordmolnens syfte, d.v.s. ordmoln som, förutom att visualisera
innehållet i en text, även har till syfte att visualisera andra aspekter. Exempelvis visar temporala ordmoln
hur ordens frekvens varierat över tid (Jatowt et al., 2021).
Vi hade också för avsikt att utöka ordmolnens syfte. Förutom att använda ordmoln för att (i) visualise-
ra innehållet i dokumenten i den textsamling vi undersökte, ville vi (ii) samtidigt även kunna visualisera
textlikhet mellan olika dokument. Vi har använt ett verktyg för att söka bland tidigare textvisualiserings-
initiativ (Kucher & Kerren, 2015)1, men inte hittat någon färdig metod för visualisering som fungerar
för exakt det vi vill visa. Vi kommer därför här att designa en enkel metod för att visualisera båda dessa
aspekter.
1https://textvis.lnu.se
Maria Skeppstedt, Gunnar Eriksson, Magnus Ahltorp and Rickard Domeij. 2022. Den ena
texten och den andra: Visualisering av textpar med hjälp av ordmoln. In Volodina, Dannélls,
Berdicevskis, Forsberg and Virk (editors), Live and Learn – Festschrift in honor of Lars Borin,
pages 121–125. Available under CC BY 4.0 121
Figur 1: Fyra par av texter visas i figuren. Par 1 består av gruppens mest kända låt och den text som var
mest lik denna låttext. Par 2 består av de två texter som var allra mest lika enligt det likhetskriterium
vi använde, par 3 av de näst mest lika texterna, o.s.v. (Titeln på låtarna står med liten stil vid sidan av
graferna. Så för den som vill gissa vilken låttext som visualiseras, rekommenderar vi att inte kika på
titeln i förväg.)
2 Textsamlingen
Vi samlade in texter från en svensk musikgrupp som främst var verksam under sent 60-tal och första
halvan av 70-talet. Vi hittade texterna på en webbsida med låttexter från gruppen2. Webbsidan innehöll
även låtar skrivna av bandets medlemmar i andra kontexter än tillsammans med gruppen. Vi allokerade
cirka tre timmar för att extrahera låttexter från webbsidan, och formatera dem i ett enhetligt textformat.
Vi gjorde en begränsad ansträngning för att enbart ta med låtar skrivna för gruppen, men även en viss
mängd låtar skrivna av bandmedlemmarna i andra sammanhang kan ha tagits med i vår textmängd. Den
allokerade tiden räckte till att skapa en textmängd bestående av 39 låttexter.
3 Metod
Metoden bestod av två delar. Dels skapade vi en dokumentvektor för varje låttext, som vi använde som
ett kriterium för att välja ut fyra par av låttexter, och dels gjorde vi en visualisering av orden i dessa åtta
låttexter.
2Webbsidan hade adressen <förnamn><efternamn>.se för en av bandmedlemmarna.
122
Figur 2: De två nedre textparen från figur 1, men där orden har behållt originalplaceringen genererad av
t-SNE-algoritmen. (Titeln på låtarna står med liten stil vid sidan av graferna. Så för den som vill gissa
vilken låttext som visualiseras, rekommenderar vi att inte kika på titeln i förväg.)
3.1 Dokumentvektorer
Skapandet av TF-IDF-vektorer för dokumenten i vår textsamling bestod av följande steg: (i) generera
statistik över svenska dokumentfrekvenser utifrån en extern textmängd, (ii) identifiera vanliga begrepps-
kluster i vår textmängd, och byta ut orden i dessa kluster mot en sträng som representerar klustret, (iii)
skapa en TF-IDF-vektor för varje dokument, (iv) skapa den slutgiltiga dokumentvektorn genom att utöka
TF-IDF-vektorn med word2vec-vektorer som representerar dokumentet.
(i) För att ha lite mer bakgrundsdata för svenska dokumentfrekvenser använde vi en stor extern korpus
för IDF-uträkningen, närmare bestämt stycken från 1000 SOU:er3, där varje stycke behandlades som ett
dokument. (Det var en väldigt slumpmässigt vald textmängd, men en mängd som borde ge tillräckligt
bra information om typiska svenska orddokumentfrekvenser för vårt syfte.)
(ii) Innan vi konstruerade TF-IDF-vektorerna, gjorde vi en pre-processning av texterna där vi bytte
ut vissa ord i texten mot en sträng som representerar ett synonym- eller begreppskluster där ordet ingår.
Detta gjordes både för bakgrundstexten och för de dokument vi ville visualisera. Exempelvis bestod
ett av våra synonymkluster av orden ”arbeta/jobba/verka”. Det innebar att varje gång någon av dessa
tre ord förekom i en text bytte vi ut ordet mot strängen ”arbeta/jobba/verka”. Dessa tre ord blev då
behandlade som ett och samma ord när TF-IDF-vektorer skapades. Vi skapade synonymklustren genom
att köra en dbscan-klustring (Ester et al., 1996) på word2vec-vektorer för orden i vår textsamling, och
därefter manuellt gick igenom och rättade alla automatiskt skapade kluster. Totalt skapade vi 160 kluster.
Ordvektorerna hittade vi genom att använda ett förtränat word2vec-ordrum4 med 100 element långa
vektorer.
(iii) Efter pre-processningen skapade vi TF-IDF-vektorer för alla dokument i vår textmängd. Vi gjorde
inte någon kontroll och/eller borttagande av dubblerad text, trots att detta är ett vanligt fenomen i låttexter,
exempelvis eftersom det ofta finns refränger. Anledningen var att om ord ofta återkom, som till exempel
i refränger, så skulle detta också faktiskt avspeglas i en högre ordfrekvens för dessa ord. För att inte göra
sådana ord helt dominerande använde vi emellertid inte standard TF utan dess logaritmiska värde.
(iv) Till den vanliga TF-IDF-vektorn konkatenerade vi sedan kombinerade ordvektorer. Detta för att
fånga likhet mellan dokument utan överlappande ord eller begreppskluster. Samma word2vec-ordrum
som ovan användes. Vi gav på flera sätt större vikt till ord med ett högt TF-IDF även för de kombinerade
ordvektorerna. För det första konstruerade vi tre olika summerade ordvektorer, som vi konkatenerade
3https://github.com/UppsalaNLP/SOU-corpus
4http://vectors.nlpl.eu/repository/ Word2Vec Continuous Skipgram tränad på Swedish CoNLL17 corpus
123
till den vanliga TF-IDF-vektorn. Två vektorer bestående av summan av vektorerna för orden med de tre
respektive tio högsta TF-IDF-värdena, och en vektor bestående av summan för alla ord i dokumentet. För
det andra multiplicerade vi alla summerade vektorer med ordets TF-IDF-värde innan vi normaliserade
vektorerna.
Alla beskrivna experiment utfördes med hjälp av maskininlärningsbibliotektet scikit-learn (Pedregosa
et al., 2011).
3.2 Ordvisualisering
För varje par av dokument beräknades det euklidiska avståndet mellan dokumentens vektorer, och de tre
par som var mest lika valdes ut. Paret bestående av gruppens mest kända låt och dess mest lika låttext i
dokumentmängden valdes också ut. För dessa fyra par skapade vi sedan en enkel visualisering i Pyplot.
Vi skapade en bild vardera för de två texterna i paret, och placerade den ena till vänster och den andra
till höger. I bilden skapade vi ett ordmoln med de 25 ord som hade högst TF-IDF-värde. Ju högre TF-IDF-
värde, med desto starkare färg skrevs ordet, för ordmolnet till vänster i blå nyanser och för ordmolnet
till höger i gröna nyanser. Kritiken mot att använda fontstorlek som indikation på ett ords signifikans till
trots, visade vi ett ord med en större fontstorlek ju högre dess TF-IDF-värde var. Vi motiverar det med
att ordmoln med varierande fontstorlek har blivit någon form av standard, på grund av dess popularitet.
Därmed finns det en poäng i att på något sätt hålla sig till den standarden. Vi gjorde dock endast en liten
ökning av fontstorleken med ökande TF-IDF-värde. Dessutom gjorde vi en liten, generell fontminskning
av ord beroende på deras längd, för att minska risken att långa ord skulle kunna uppfattas som viktigare
än korta ord. För att ge betraktaren en mer objektiv indikation på ett ords TF-IDF-värde än ordets färg
och fontstorlek lade vi även dit en understrykning av ord, där längden på det streck med vilket ordet är
understruket är direkt proportionellt mot ordets (logaritmiska TF)-IDF-värde.
För att lägga en verklig betydelse i ordens placering i molnet, skapade vi en utplacering som place-
rade ord med liknande betydelse nära varandra. Vi bestämde ordens placering genom att köra t-SNE-
algoritmen (van der Maaten & Hinton, 2008) på word2vec-vektorerna som hörde till orden i texten. Att
placera ut orden exakt på den plats som bestämdes av t-SNE-algoritmen skulle emellertid göra att de
överlappade varandra och bli svårlästa. Vi provade att använda ett standardbibliotek, adjustText, som
placerar om text för att undvika detta. Dock ledde det till att orden då istället förlorade den struktur de
givits av t-SNE-algoritmen, trots att vi experimenterade med olika parametrar till adjustText. Vi imple-
menterade därför istället en egen enkel algoritm för att flytta om orden. Algoritmen placerar först ut det
ord som har högst TF-IDF-värde, och fortsätter sedan nedåt med fallande TF-IDF-värde. Om ett nytt ord
som ska placeras ut överlappar ett ord som redan är utplacerat flyttas det nya ordet uppåt i grafen tills det
nya ordet inte längre överlappar med ett tidigare utplacerat ord. Detta gör att de ord som är viktigast för
texten oftare placeras i enlighet med dess innebörd, medan mindre viktiga ord kan flyttas runt.
Ord som ingår i ett synonymkluster indikeras med ett plus efter ordet. Endast det första ordet i ett
kluster visas.
4 Resultat
De fyra par av texter som vi visualiserade visas i figur 1, och i figur 2 visas hur visualiseringen skulle ha
sett ut om vi inte flyttat orden för att undvika överlapp.
Bilderna är ett försök att både skapa en visualisering av texterna där ordens placering har en innebörd,
och, genom att placera texterna i par, även skapa en visualisering där två texter lätt kan jämföras. Vi vill
att visualiseringen snabbt ska kunna svara på frågan: Vad innehåller den ena texten, och vad innehåller
den andra?
Vi lämnar utvärderingen av huruvida visualiseringen ger en bra översikt över textparen till läsarna,
särskilt till läsare som är väl förtrogna med musikgruppen i fråga. Går det exempelvis att gissa vilken
gruppen är? Går det att gissa vilka låttexter som visualiseras i bilderna?
124
Tack
Vi vill tacka Nationella språkbanken (Vetenskapsrådet, 2017-00626), HUMINFRA (Vetenskapsrådet,
2021-00176) samt Dagstuhl-seminariet ”Visual Text Analytics”.
Referenser
Lukas Barth, Stephen G. Kobourov, & Sergey Pupyrev. 2014. Experimental comparison of semantic
word clouds. In Joachim Gudmundsson & Jyrki Katajainen, editors, Experimental Algorithms, pages
247–258, Cham. Springer International Publishing.
Martin Ester, Hans-Peter Kriegel, Jörg Sander, & Xiaowei Xu. 1996. A density-based algorithm for
discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International
Conference on Knowledge Discovery and Data Mining, pages 226–231, Palo Alto, California. AAAI
Press.
Adam Jatowt, Nina Tahmasebi, & Lars Borin. 2021. Computational approaches to lexical semantic
change: Visualization systems and novel applications. In Nina Tahmasebi, Lars Borin, Adam Jatowt,
Yang Xu, & Simon Hengchen, editors, Computational approaches to semantic change. Language Sci-
ence Press.
Kostiantyn Kucher & Andreas Kerren. 2015. Text visualization techniques: Taxonomy, visual survey,
and community insights. In 2015 IEEE Pacific Visualization Symposium (PacificVis), pages 117–121.
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Gri-
sel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre
Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, & Edouard Duchesnay. 2011. Scikit-
learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
Erich Schubert, Andreas Spitz, Michael Weiler, Johanna Geiß, & Michael Gertz. 2017. Semantic word
clouds with background corpus normalization and t-distributed stochastic neighbor embedding. ArXiv,
abs/1708.03569.
Laurens van der Maaten & Geoffrey Hinton. 2008. Visualizing Data Using t-SNE. Journal of Machine
Learning Research, 9:2579–2605.
Fernanda B. Viégas & Martin Wattenberg. 2008. Timelines tag clouds and the case for vernacular
visualization. Interactions, 15(4):4952.
125

Datorlingvistikens vagga i Uppsala
Anna Sågvall Hein
Uppsala universitet
anna@lingfil.uu.se
Abstract
Computational Linguistics at Uppsala University has a long history. It goes back to the mid 1960-
ies, when the University Computer Center, UDAC, was established. It should provide a powerful
computer resource for research at the university and encourage the use of computers in new fields
such as the Humanities. With this aim in mind, the director of UDAC approached the language
departments. This initiative led to a contact with the Slavic department and a joint decision to sup-
port an ongoing thesis work in Slavic languages, directed towards computational linguistics. On
completion of the thesis, the researcher was employed at UDAC with the mission to do research
in Computational Linguistics and support Natural Language Processing. This was the embryo
of the Natural Language Processing group at UDAC and its follower Center for Computational
Linguistics at the Faculty of Humanities. We describe activities in this early period with a focus
on corpus linguistics, process morphology and program development, where pioneering contri-
butions were made by Lars Borin. We also emphasise the importance of the consistent and close
cooperation between the Faculty of Humanities and UDAC, a prerequisite for the successful
development of Computational Linguistics in Uppsala and the establishment of a chair in 1987.
1 Inledning
Datorlingvistiken har en lång historia i Uppsala. Den går tillbaka till 1960-talet, närmare bestämt till
1965, då Uppsala universitetsdatacentral, UDAC, inrättades. Universitetsdatacentralerna skulle tillgodose
kända behov av datorkraft för forskning i teknik och naturvetenskap men också verka för att initiera
användning av datorer inom andra områden.
Den nytillträdde chefen för UDAC, docent Werner Schneider, tog sig omedelbart an uppgiften. Han
inriktade sig mot den Humanistiska fakulteten och bjöd in institutionerna till seminarier, där han informer-
ade om vilka möjligheter den nya tekniken kunde erbjuda. Pågående och planerad forskning presenter-
ades och institutionerna fick möjlighet att föreslå pilotprojekt. Av de språkvetenskapliga institutionerna
var det Slaviska institutionen, företrädd av docent Carin Davidsson, som visade störst intresse.
Två pilotprojekt specificerades. Det ena gällde finalalfabetisk sortering av ett tjeckiskt ordboksmaterial.
Det andra handlade om att utveckla datorbaserade verktyg för lingvistiska ändamål. Ett licentiandarbete i
slaviska språk med datorlingvistisk inriktning pågick redan. Det bedrevs av en studerande på stipendium
i Leningrad. Gemensamt beslutade man stödja detta projekt med programmeringshjälp och annat som
behövdes. Satsningen föll väl ut och resulterade i en avhandling om automatisk analys av det ryska verbet
(Sågvall, 1968) samt att licentiaten anställdes på UDAC för att bedriva forskning i datorlingvistik samt
understödja språklig databehandling. Det blev embryot till Språkgruppen på UDAC (Natural Language
Processing Group) och dess uppföljare Centrum för datorlingvistik vid Humanistiska fakulteten. 1987
beslutade regeringen om en professor i datorlingvistik.
En av pionjärerna var Lars Borin. Med sitt breda språkvetenskapliga och datavetenskapliga kunnande
bidrog han på ett ovärderligt vis till utvecklingen av forskningsmiljön. Hans insatser gällde främst rysk
korpuslingvistik, processmorfologi och programutveckling, intresseområden som får särskild uppmärk-
samhet i framställningen nedan.
Anna Sågvall Hein. 2022. Datorlingvistikens vagga i Uppsala. In Volodina, Dannélls,
Berdicevskis, Forsberg and Virk (editors), Live and Learn – Festschrift in honor of Lars Borin,
pages 127–132. Available under CC BY 4.0 127
2 Språkgruppen (Natural Language Processing Group, NALP)
Intresset för språklig databehandling visade sig vara stort på fakulteten. Bland de språk som databehand-
lades i initialskedet fanns svenska, engelska, finska, tjeckiska, estniska, tyska, franska, persiska och ryska.
Senare tillkom sanskrit och ungerska. I samverkan med språkvetare från de olika institutionerna byggde
medlemmar i språkgruppen1 upp språkliga resurser för såväl forskning som undervisning (NALP, 1978).
De svarade också för databehandlingen vid genomförandet av undersökningarna (Sågvall et al., 1975).
För engelskans del införskaffades the Brown Corpus.2 Den förelåg i ett primitivt hålkortsformat men
moderniserade så att den blev användbar för de svenska forskarna. Moderniseringen handlade främst
om omkodning av teckenrepresentationen samt uppdelning av korpusen i meningar (sentences), så att
användaren kunde arbeta med dem i stället för hålkortsbilder.
När verksamheten inleddes 1970 fanns inga färdiga program för språklig databehandling att tillgå. De
fick skrivas av medlemmar i språkgruppen allt efter behov (Sågvall et al., 1974a).
2.1 Infrastruktur
1970 såg infrastukturen helt annorlunda ut jämfört med idag. Alla körningar gjordes på en stordator (IBM
370/155). Den stod i en maskinhall till vilken bara särskilda operatörer hade tillträde. Inmatning av såväl
program som data skedde via hålkort. Den utfördes av stansoperatriser. Hålkorten representerade alfanu-
meriska tecken enligt en IBM-standard som var utformad för de västeuropeiska språken. Den uteslöts
språk som ryska, polska, ungerska och tjeckiska. För dem krävdes transliteration. För ryskans del fast-
ställdes en konvention i samarbete med den stansoperatris som skulle mata in materialet. Man bemödade
sig om att finna sådana motsvarigheter som på grund av grafiska likheter kunde memoreras av den som
inte kunde det kyrilliska alfabetet.
Även resultaten av körningarna kom ut på hålkort och hålkortsbilderna kunde skrivas ut på en radskri-
vare. Det var inget tilltalande format för språkliga data, särskilt inte för data i translittererad form. Utmat-
ning med specialalfabeten var ett särskilt problem. Det löstes i vissa fall med hjälp av en printer-plotter.
Den kunde programmeras till att rita godtyckliga bokstäver och så skedde för de grekiska, kyrilliska och
franska alfabetena (med accenter). Plottern kunde också växla mellan olika fonter på en och samma rad,
något som behövdes för utskrift av rysk-svenska ordlistor i ett pedagogiskt projekt. Tryckaccent för de
ryska orden skrevs också ut. En annan lösning var att låta datorn producera binärkort som kunde läsas
av en skrivmaskin med utbytbart typhuvud som var kopplad till en minidator. Metoden testades ut och
användes för produktion av det tjeckiska lexikonet inför tryckningen.
2.2 Rysk morfologi
Till en början var det ryska som stod i centrum. Den automatiska analysen av det ryska verbet följdes
av en modell för analys av hela den ryska böjningsmorfologin (Sågvall, 1973). Systemet, AUTLEX,
tilldelade ryska ord lexikala beskrivningar som bland annat upptog ordklass, lemma och böjningsform.
Ordklassindelningen byggde strikt på böjningskategorier och skiljde sig därigenom något från den tradi-
tionella ordklassindelningen. AULEX använde sig av ett stamlexikon och ord som saknades i lexikonet
fick ingen analys medan homografa ord fick flera analyser, en för varje homografkomponent, AUTLEX
användes flitigt för annotering av ryska korpusdata. Systemet opererade på ord i isolerad ställning och
kunde således inte skilja ut kontextberoende homografer. För detta krävdes separat homografseparering.
Den utfördes till en början helt manuellt, men gradvis introducerades metoder för att underlätta arbetet.
Den första större tillämpningen gällde ett pedagogiskt projektet Målanalys i ryska (Sågvall et al.,
1976). Det genomfördes med bidrag från Universitetskanslersämbetet. Med hjälp av AUTLEX annoter-
ades ett urval ryska litterära texter avsedda för textläsning på universitetets A-nivå. Genom statistiska
beräkningar kunde de annoterade texterna ordnas i stigande svårighetsgrad efter ordrikedom. Vidare
kunde man definiera den lexikala progressionen vid läsning av texterna i den fastställda ordningen. Det
1Medlemmarna i Språkgruppen växlade under årens lopp. 1978 bestod den av Annika Mattsson, Uwe Hein, Tone Tingsgård
och Erling Wande. Gruppen leddes av Anna Sågvall Hein.
2A Standard Sample of Present-day Edited American English. 1964. The Brown university, Providence R.I., USA.
128
var grunden för utveckling av ett läromedel för rysk textläsning Text och Ord (Sågvall et al., 1974b).3
En annan större tillämpning gällde textattribution.4 Den behandlade frågan om huruvida nobelprista-
garen Sjolochov skrivit hela Stilla flyter Don. Det hade hävdats att en del av verket var skriven av en
vitrysk officer Krjukov. Studien byggde på tre korpusar om vardera 50 000 ord, där en innehöll säker
Sjolochovtext, en säker Krjukovtext och en tredje den omdiskuterade delen av verket.
Korpusarna matades in i datorn via hålkort och genomgick olika kvantitativa beräkningar. Det rörde sig
om antal stavelser, medellängd per 1000 löpande ord, antal olika ordformer per 1000 löpande ord samt
frekvens och frekvensdistribution. Motsvarande beräkningar gjordes på lexemnivå. På den syntaktiska
sidan jämförde man meningsinledning i de olika korpusarna genom att undersöka sekvenser av lexem och
ordklasser. En sammanställning av de olika statistiska beräkningarna visade att Krjukov kunde uteslutas
som möjlig författare, men inte Sjolochov (Kjetsaa et al., 1984).
Den språkliga databehandlingen utfördes av Språkgruppen och senare Centrum för Datorlingvis-
tik. Den innefattade igenkänning av lingvistiska begrepp som stavelser, ordformer, lexem, ordklasser,
meningar och skiljetecken samt beräkningar på dessa enheter. Igenkänning av lexem och ordklasser
gjordes genom analys med AUTLEX och homografseparering. Den underlättades genom ett nyutvecklat
interaktivt program. Analysen föregicks av korpusanpassning av lexikonet.
2.3 Parsning
Forskning om generella metoder för parsning (lingvistisk analys) inleddes. Efter utprovning av Aug-
mented Transition Network Grammars (Woods, 1973) inriktades forskningen mot nätverksbaserade
analysmodeller och procedurella formalismer. Ett första konkreta resultat var implementering av en ex-
perimentversion av en chartparser. Den byggde på föreläsningsanteckningar om en lingvistisk processor
som presenterades vid en internationell sommarskola i Pisa (Kay, 1974). Processorn simulerade en icke-
deterministisk maskin och alternativa anlyser lagrades i en central datastruktur, benämnd chart. Charten
var en riktad graf vars bågar bar lingvistisk information. Den gjorde det möjligt att hantera flera olika
processer i ett och samma ramverk. De lingvistiska reglerna lagrades i s.k. väntelistor i bågarna.
Uppsalaparsern (Sågvall et al., 1975) medgav experiment med morfologisk och syntaktisk analys
inklusive morfografematisk omskrivning och lexikonsökning. Med hjälp av en överordnad funktion
kunde man välja vilken process man ville utföra. Parsern testades på en liten svensk grammatik. Im-
plementeringen utgick från ett 80-tal LISP-funktioner, som Kay tillhandahöll. Den utgjorde en värdefull
miljö för kompentensuppbyggnad.
Senare presenterade Kay (1977) en generalisering av principerna för processning i en chart parser. Det
skedde genom en utveckling av charten. Inte bara lingvistisk information utan också regler representer-
ades av bågar, passiva och aktiva. Genom ett samspel mellan aktiva och passiva bågar drevs processnin-
gen framåt. Kay ställde originalprogramvaran av den nya parsern till förfogande för språkgruppen och
vid en forskningsvistelse i Uppsala medverkade han i implementeringen av den. Det gjordes i en lokal
version av INTERLISP på UDAC:s stordator. Det blev utgångspunkt för utvecklingen av Uppsala Chart
Processor, UCP.
UCP hanterade fonologisk, morfologisk och syntaktisk analys inklusive morfografematisk omskrivn-
ing och lexikonsökning. Den grafematiska omskrivningen och den morfologiska analysen provades ut på
flera språk, i första hand finska (Sågvall Hein, 1977; Sågvall Hein, 1979) men också på ryska och ser-
bokroatiska. Sågvall Hein (1980) föreslår en modell för finska där ordigenkänningen sker i två parallella
processer, fonologisk analys i stavelser och morfologisk analys i morfer.
3 Centrum för datorlingvistik
På förslag från fakulteten inkorporerades språkgruppen i Humanistiska fakulteten som Uppsala Cen-
trum för Datorlingvistik, UCDL (Sågvall Hein, 1981). Det skedde 1980. Enligt en fastställd instruktion
3Text och ord omfattade texterna ordnade efter svårighetsgrad åtföljda av kommenterade ordlistor med översättning, ett
basordförråd samt ett ackumulerat ordförråd. Hela publikationen gjordes digitalt och skrevs ut med Benson Printer Plotter.
Trots det primitiva formatet kom läromedlet att användas på universitetet och vid Arméns tolkskola under ett 10-tal år.
4Projektet var ett samarbete mellan Institutionen för slaviska språk vid Uppsala universitet och Slavisk-Baltisk Institutt vid
Oslo universitet.
129
skulle verksamheten koncentreras till forskning i datorlingvistik samt verka för att genomföra språkveten-
skapliga sektionens handlingsprogram för språklig databehandling samt främja historisk-filosofiska sek-
tionens handlingsprogram för databehandling i textbaserade undersökningar. Humanistiska fakultetsnäm-
nden och UDAC fick gemensamt ansvar för planering och uppföljning av verksamheten.
I mars 1981 togs ett viktigt steg i utvecklingen. Då fastställde Språkvetenskapliga sektionsnämnden
en ämnesbeskrivning av datorlingvistik. Datorlingvistik är inriktad på simulering av språkligt beteende
med dator. Till ämnet hör även allmän metodik för undersökningar av språk med hjälp av dator (språklig
databehandling). En docenttjänst i datorlingvistik skapades. Dess innehavare skulle leda verksamheten
på centret. Vidare fick centret tre deltidstjänster för programmering och systemarbete samt två språkkon-
sulttjänster, en i Nordiska språk (Olle Hammermo) och en i Slaviska språk (Lars Borin). Tidigare verk-
samhet inom språkgruppen följdes upp i Centrum för datorlingvistik med visst fokus på chartparsning
och processmorfologi. Programutvecklingen ägnades också fortsatt uppmärksamhet.
3.1 Chartparsning
På parsningssidan inriktades forskningen primärt mot vidareutveckling av Uppsala Chart Parser (Carls-
son, 1982; Sågvall Hein, 1980; Sågvall Hein, 1987)5 samt uppbyggnad av en parser för svenska med
tillhörande lexikon (Sågvall Hein, 1983; Sågvall Hein & Ahrenberg, 1985).
3.2 Processmorfologi
Forskning om generella metoder för processmorfologi inleddes. Utgångspunkt var Tvånivåmorfologi
(Koskenniemi, 1983). Medan tidigare modeller för morfologisk processning varit inriktade på analys
eller syntes, så möjliggjorde tvånivåmorfologin såväl analys som syntes. Med morfologisk analys förstås
en process där en ytrepresentation, en bokstavssträng, tilldelas en morfologisk beskrivning (t.ex. AUT-
LEX), medan morfologisk syntes går den omvända vägen, dvs. från en morfologisk beskrivning till en
ytrepresentation.
Tvånivåmorfologin erbjuder en generell, riktningsoberoende formalism samt ett processerande mask-
ineri. Maskineriet utgörs av finite-state-automater, FSA. En FSA är en abstrakt maskin som vid en given
tidpunkt befinner sig i ett av ett ändligt antal tillstånd. Automaten definieras av en lista över möjliga
tillstånd, ett initialt tillstånd och ett finalt tillstånd, och regler som styr övergången från ett tillstånd till
ett annat. Tillstånden kan ses som noder i ett övergångsnätverk. I tvånivåmorfologin utgörs noderna av
tecken på två nivåer, lexikal nivå och ytnivå. Övergången från en nod till en annan styrs av parallella
regler. De jämför strängar på de båda nivåerna och anger om de är tillåtna motsvarigheter till varandra
eller inte. För utprovning av systemet ingick en beskrivning av finsk morfologi.
Tvånivåmorfologin fick stor spridning internationellt och implementerades tidigt i Uppsala. Lars Borin
inledde utforskandet av tvånivåmorfologin på polsk böjningsmorfologi. I en kurs i språklig databehan-
dling utgick han från en tvånivåbeskrivning för svenska (Blåberg, 1984) som han vidareutvecklade
(Borin, 1985). Processmorfologi kom också att bli hans primära forskningsområde inom vilket han skrev
sin doktorsavhandling (Borin, 1991).
3.3 Programvaruutveckling
Utvecklingen av programvara för språklig databehandling fortsatte med TEXTPACK (Rosén & Sjöberg,
1985) som bas. De dataformat som programmen arbetade med var i hög grad standardiserade, men åtkom-
stmetoderna var avhängiga av operativsystemet (IBM 370/155) och de accessoperatorer som program-
meringsspråket (PL/I) tillhandahöll.
För att komma ifrån detta maskinberoende inledde Borin (1984) arbetet med att definiera och imple-
mentera ett generellt textbearbetningssystem, ett textdatabassystem för lingvister. Det skulle vara använ-
darvänligt, generellt och portabelt. Han gjorde en djupdykning i det ungerska alfabetet för att demonstr-
era de problem man kan stöta på vid alfabetisering av språkliga data. I den föreslagna lösningen skulle
man använda sig av ett användardefinierat alfabet med en 16-bitsrepresentation av varje tecken. Digram,
trigram osv. skulle räknas som egna bokstäver och behandlas som odelbara enheter. De kommersiellt
5Utvecklingen av UCP bedrevs i projektet Computer Simulation of the Text Comprehension Processmed stöd från Styrelsen
för Teknisk Utveckling samt UDAC.
130
tillgängliga sorteringsprogrammen kunde inte komma åt problemen med bokstavsdiagram och längre
enheter.
Med definitionen och den inledande implementeringen av detta första textdatabassystem såddes ett
frö till kommande användning av språkdatabaser för lagring och åtkomst av språkliga data, något som
kommit att bli standard i Uppsala.
4 Konklusion
De tidiga satsningarna på datorlingvistik i Uppsala bar frukt 1987, då regeringen beslutade inrätta en
professur i ämnet. I propositionen konstateras att Universitetet i Uppsala har utifrån tidigare gjorda
satsningar goda förutsättningar att erbjuda en gynnsam forskningsmiljö i detta ämne, [...] Jag föror-
dar att en professur i datorlingvistik inrättas6. Ämnesbeskrivningen för docenturen behölls och tidigare
forskning följdes upp. Nya forskningsområden var bland annat maskinöversättning och datoriserad språk-
granskning.
Genom professuren legitimerades ämnet och viktiga steg kunde tas. Både forskarutbildning och grun-
dutbildning anordnades. En plan för forskarutbildningen fastställdes 1990 och grundutbildning inleddes
1995 med Språkteknologiprogrammet. Det utgjorde den främsta rekryteringsbasen för forskarutbildnin-
gen. Programstudenterna var också efterfrågade på arbetsmarknaden.
Den första doktorn i datorlingvistik var Lars Borin. Med sin doktorsexamen representerar han ett av
Uppsalas viktigaste bidrag till ämnets utveckling, nationellt och internationellt. Lars och hans meddok-
torer gör, och har gjort, viktiga insatser inom universitetsvärlden och i samhället i övrigt. Noteras kan
att utvecklingen av den artificiella intelligensen kan spåras tillbaka till maskininlärning och datadrivna
metoder som bland annat introducerats inom datorlingvistiken. I sin doktorsavhandling presenterar Borin
(1991) en sådan ansats.
Det har varit ett nöje att få följa Lars på vägen från doktorand i slaviska språk till doktor i datorlingvis-
tik. Med tacksamhet ser vi på de bidrag till den tidiga utvecklingen av verksamheten som han givit.
Det långa och nära samarbetet mellan Språkvetenskapliga sektionen och Uppsala datacentral framstår
som avgörande för den positiva utvecklingen av datorlingvistiken i Uppsala.
Referenser
Olli Blåberg. 1984. Svensk Böjningsmorfologi. En tvånivåbeskrivning. Unpublished Master’s Thesis,
Department of General Linguistics, University of Helsinki.
Lars Borin. 1984. Ett textdatabassystem för lingvister (A text database system for linguists)[In Swedish].
I: Anna Sågvall Hein (utg.). Föredrag vid De Nordiska datalingvistikdagarna 1983. Uppsala den 3-4
oktober. In Proceedings of the 4th Nordic Conference of Computational Linguistics (NODALIDA
1983) Rapport UCDL-R-841. Uppsala universitet. Centrum för datorlingvistik., pages 37–47.
Lars Borin. 1985. Tvånivåmorfologi. Introduction och användarhandledning. Technical report, Rapport
UCDL-L-3. Uppsala universitet. Centrum för datorlingvistik.
Lars Borin. 1991. The automatic induction of morphological regularities. Department of Linguistics,
Uppsala University. Reports from Uppsala University, Linguistics (RUUL), 22.
Mats Carlsson. 1982. Uppsala Chart Parser, 2. System Documentation. UCDL R-81-1. Uppsala Univer-
sity. Center for Computational Linguisics.
Martin Kay. 1974. Morphological and Syntactic Analysis. Lecture notes from the 3rd International
Summer School of Computational and Mathematical Linguistics. Pisa 1974.
Martin Kay. 1977. Reversible grammar. Handbook from the 1977 Nordic Summer School in Computa-
tional Linguistics. Palo Alto.
6Regeringens proposition 1986/87:80
131
Geir Kjetsaa, Sven Gustavsson, Bengt Beckman, & Steinar Gil. 1984. The Authorship of the Quiet Don.
Slavica Norvegica, vol.1. Oslo and Atlantic Highlands, N.J.
Kimmo Koskenniemi. 1983. Two-level Morphology: A General Computational Model för Word-form
Recognition and Production. Publications of the Department of General Linguistic, University of
Helsinki, Finland.
NALP. 1978. Machine-readable text and dictionary material at UDAC. Technical report, Report No 4
1978. Uppsala University Data Center. Uppsala. Natural Language Processing Group.
Valentina Rosén &Margareta Sjöberg. 1985. TEXTPACK programpaket för språkvetenskaplig textbear-
betning. Technical report, Centrum för datorlingvistik. Uppsala universitet. UCDL-L-85-2.
Anna-Lena Sågvall. 1968. Ett system för automatisk morfologisk analys av det ryska verbet, applicerat
på en c:a 80 sidor lång rysk text. Licentiatavhandling. Uppsala universitet. Slaviska institutionen.
Anna-Lena Sågvall. 1973. A System for Automatic Inflectional Analysis. Implemented for Russian. Data
linguistica 8. Stockholm. Almkvist & Wiksell.
Anna-Lena Sågvall, Berith Brännström, & Agneta Berghem. 1974a. Presentation av vid UDAC utveck-
lade verktyg för behandling av naturligt språk. Technical report, UDAC. Uppsala universitetsdatacen-
tral. Uppsala, Sverige.
Anna-Lena Sågvall, Beritha Brännström, & Agneta Berghem. 1974b. Text och Ord 1. Slaviska institu-
tionen. Uppsala universitet, Sverige.
Anna-Lena Sågvall, Berith Brännström, & Agneta Berghem. 1975. Processing Natural Language at
UDAC. Technical report, Report No 2. September 1975. Uppsala universitetsdatacentral. Uppsala,
Sverige.
Anna-Lena Sågvall, Berith Brännström, & Agneta Berghem. 1976. MIR: A Computer Based Approach
to the Acquisition of Russian Vocabulary in Context. System, 4(2):116–127.
Anna Sågvall Hein & Lars Ahrenberg. 1985. A Parser for Swedish. Status Report for SVE. UCP.
Technical report, Rapport UCDL-R-85-2. Uppsala universitet. Centrum för datorlingvistik.
Anna-Lena Sågvall Hein. 1977. Chartanalys och morfologi. In Martin Gellerstam, editor, Nordiska
Datalingvistikdagar 1977. Föredrag från en konferens i Göteborg.
Anna-Lena Sågvall Hein. 1979. Natural Language Processing Group (NALP) at Uppsala Univer-sity
Data Center (UDAC). In Nordic Linguistic Bulletin. Vol 3. No 1.
Anna Sågvall Hein. 1980. An overview of the Uppsala Chart Parser version 1 (UCP-1). Report no.
UCDL-R-80-1 Center for Computational Linguistics. Uppsala University, Department of Linguistics.
Anna Sågvall Hein. 1981. UCDL. Centrum för Datorlingvistik vid Uppsala universitet. En presentation.
Technical report, Rapport UCDL-R-81-3. Centrum för datorlingvistik. Uppsala universitet, Sverige.
Anna Sågvall Hein. 1983. A Parser for Swedish. Status Report for SVE. UCP. Technical report, Rapport
UCDL-R-83-2. Uppsala Universitet. Centrum för datorlingvistik.
Anna Sågvall Hein. 1987. Parsing by means of Uppsala Chart Processor (UCP). In Natural Language
Parsing Systems, pages 203–266. Springer Verlag Berlin Heidelberg.
William A Woods. 1973. An experimental parsing system for Transition Network Grammars. pages
111–154. Algorithmics Press. New York.
132
From open parallel corpora to public translation tools:
The success story of OPUS
Jörg Tiedemann
Department of Digital Humanities Language Technology
University of Helsinki, Finland
jorg.tiedemann@helsinki.fi
Abstract
This paper describes the success of OPUS, starting from a small side-project but leading to a
full-fledged ecosystem for training and deploying open machine translation systems. We briefly
present the current state of the framework focusing on the mission of increasing language cover-
age and translation quality in public translation models and tools that can easily be integrated in
end-user applications and professional workflows. OPUS now provides the biggest hub of freely
available parallel data and thousands of open translation models have been released supporting
hundreds of languages in various combinations.
1 Introduction
The starting point of OPUS is clearly connected to Uppsala and the language technology research group
at the former department of linguistics. Work with parallel corpora has been pushed by projects on ma-
chine translation (Tjong Kim Sang, 1999) and multilingual corpus-driven linguistics and lexicography
(Borin, 1998; Borin, 2002). The significant value of aligned multilingual data sets had been recognized
by the leading researchers in the group and various resources came out of their efforts together with ap-
plications in translation studies, bilingual lexicon induction and machine translation development (Borin,
2000a; Borin, 2000b; Sågvall Hein et al., 2002). Inspired by those projects, OPUS filled the gap of public
data sets that can be freely shared and used in research and development. Initially starting with software
localization data, OPUS slowly grew into a massive collection of parallel translation data covering hun-
dreds of languages and thousands of language pairs coming from a wide variety of domains.
The mission of OPUS was clear from the beginning: Data sets in the collection shall be open and
free and support reproducible science to push cross-lingual NLP research and machine translation in
particular. The essential principle is to provide a consistent interface to data sets that are readily prepared
for further work without losing information from the original source. Wide language coverage has been
a goal from the start with a complete alignment across all languages included.
The collection now represents a crucial foundation for wide-coverage machine translation. Taking
advantage of the huge resource, we launched OPUS-MT (Tiedemann & Thottingal, 2020), an initiative
to systematically exploit the data set to train open neural machine translation (MT) models that can
be shared and re-used as well. The project tackles the growing responsibility of language technology
providing essential tools for fair information access without language barriers and avoiding commercial
exploitation. Our focus is on transparency and the paper describes our efforts in building the infrastructure
that enables the use of free and independent machine translation in end-user applications and professional
workflows.
Below, we briefly provide the background on OPUS and present tools for finding and processing the
data. We then introduce OPUS-MT and its components before discussing the integration of pre-trained
translation models in development platforms, end-user applications and translation workflows. Finally,
Jörg Tiedemann. 2022. From open parallel corpora to public translation tools: The success story
of OPUS. In Volodina, Dannélls, Berdicevskis, Forsberg and Virk (editors), Live and Learn –
Festschrift in honor of Lars Borin, pages 133–138. Available under CC BY 4.0 133
Figure 1: OPUS and OPUS-MT and its connections to other components, platforms and applications.
we also present the importance of benchmarking and monitoring the progress and briefly mention on-
going work on scaling up language coverage and optimizing translation models in terms of speed and
applicability. Figure 1 illustrates the connections between various components.
2 OPUS – The Open Parallel Corpus
OPUS1 has been a major hub for parallel corpora since 2004 (Tiedemann & Nygaard, 2004; Tiedemann,
2009; Tiedemann, 2012). The current release covers over 600 languages compiled into sentence-aligned
bitexts for more than 40,000 language pairs. Over 20 billion sentences and sentence fragments correspond
to 290 billion tokens and the data set contains about 12 TB of compressed files. Despite the typical Zipfian
distribution, there are over 300 language pairs with more than one million sentence pairs, a good base
for high quality machine translation.
OPUS tries to follow a consistent format with a simple standalone XML format for language content
and standoff annotation in XCES Align to annotate links between translated sentences. The latter enables
a space-efficient way of storing bilingually-aligned multilingual data sets without duplicating essential
content. For convenience, other common data formats are generated from the native OPUS format includ-
ing plain text versions with aligned sentences on corresponding lines and translation memory exchange
files (TMX) that are common in professional translation platforms. Additionally, OPUS also releases to-
ken frequency counts, word alignment files and rough bilingual dictionaries extracted from automatically
aligned bitexts.
Recently, we also released a compilation of the data under the label of the Tatoeba Translation Chal-
lenge2, TTC for short (Tiedemann, 2020). The purpose of this release is to provide a streamlined collec-
tion for MT training pipelines. The latest release of the TTC includes 29 billion translation units in 3,708
bitexts covering 557 languages altogether. We made an effort to unify different sources, to improve the
consistency in language labeling and to remove noise and duplicates. Dedicated development and test
sets are also provided to make the application of TTC as straightforward as possible in standard machine
learning setups.
2.1 Finding and processing OPUS data with the OPUS-API and OpusTools
An important ingredient for OPUS is automation. Making resources available requires efficient ways of
finding and accessing them. The OPUS-API3 provides an online API for searching resources and enables
1https://opus.nlpl.eu
2https://github.com/Helsinki-NLP/Tatoeba-Challenge
3https://opus.nlpl.eu/opusapi/
134
queries for specific languages and corpora. It provides the essential information about released data sets
and returns download links to fetch data from the external data storage. The API responds in simple
JSON format, which can easily be used programmatically when looking for resources.
We make use of the OPUS-API ourselves with the implementation of the OpusTools pack-
age4 (Aulamo et al., 2020a). This software library provides a Python interface with methods for locating,
downloading and converting OPUS data sets. Command-line scripts such as opus_read provide conve-
nient functions to query the database and to fetch data from the original storage. Furthermore, the tools
read from compressed release-packages and can be used to convert data sets into various formats such as
TMX and plain text on the fly. Sentence alignments can also be filtered based on alignment confidence
score, alignment type or language flag. For the latter, the package includes tools for automatic language
identification.
2.2 Cleaning parallel data with OpusFilter
OpusFilter5 (Aulamo et al., 2020b) integrates the functionality provided by OpusTools and the OPUS-
API but adds a modular system for filtering and preparing parallel data sets. It provides a wide variety
of modules for data preparation and noise reduction. A YAML configuration file defines the pipeline to
transform raw corpus files to clean training and test set files. The same pipeline can be generalized over
multiple language pairs. The toolbox can easily be extended and currently supports different kinds of
segment-level processing steps such as tokenization and subword splitting as well as filters based on au-
tomatic language identification, word alignment scores, language models and sentence embeddings. Fur-
thermore, scores can be analyzed and visualized, and custom classifiers can be trained to make domain-
specific filter decisions (Vázquez et al., 2019).
3 Machine translation with OPUS-MT
The natural next step after collecting and compiling parallel data is to systematically exploit them in
learning machine translation models. OPUS-MT6 aims to provide training pipelines and solutions for
deploying MT models derived from OPUS data. The goal is to develop a major hub for open state-
of-the-art models with a large language coverage and straightforward use in end-user applications and
further research and development. The framework is based on Marian, an efficient implementation of
neural machine translation (NMT) in pure C++ and with minimal dependencies (Junczys-Dowmunt et
al., 2018).
OPUS-MT training pipelines come in the form of makefile recipes that enable massive and systematic
experiments on high-performance computing facilities. Automation provided by the recipes cover all
necessary sub-tasks for preparing data sets, training models, testing their performance and finally releas-
ing pre-trained NMT models. Special care has been taken to allow the creation of multilingual models
that support more than one language as input or output. The recipes transparently handle different lan-
guage combinations and combine data sets as necessary adding language flags if required (Johnson et al.,
2017). Subword segmentation using SentencePiece (Kudo & Richardson, 2018) is fully integrated, and
automatic word alignment (Östling & Tiedemann, 2016) can be used to train transformer models with
guided alignment features. Batch jobs can easily be created to run on SLURM-based task management
systems.
OPUS-MT further provides pipelines for data augmentation using back-translation (Sennrich et al.,
2016) or pivot-based triangulation. Fine tuning is also supported in order to adapt to specific domains,
user-specific data sets or selected language pairs in mulitlingual models.
3.1 Integrating OPUS-MT
Important for the success of pre-trained models is the ease of use and deployment. OPUS-MT strives to
make the models accessible and useful for a wide range of users. Substantial efforts have been made to
provide simple deployment procedures and integration routines for all our models.
4https://github.com/Helsinki-NLP/OpusTools
5https://github.com/Helsinki-NLP/OpusFilter
6https://github.com/Helsinki-NLP/OPUS-MT
135
Figure 2: Language coverage of translation models visualized on an interactive map. Geolocations of
languages are taken fromGlottolog and dot colors indicate the translation quality in terms of an automatic
evaluation metric measured on the Tatoeba test set in this case on a scale from green (best) to red (worst).
Smaller circles refer to smaller, less reliable test sets.
First of all, we provide methods to create translation servers using web applications that provide ser-
vice APIs through web sockets and requests. Servers can easily be configured using JSON files and the
API also uses JSON for communication. Multiple translation servers can be combined and accessed via
the same interface and caching is implemented to decrease the workload of the server. All pre-trained
models we release can be integrated in the server solutions we offer.
Another important integration is the conversion of the native Marian NMT models to PyTorch, which
opens up their use in a wide range of applications through the popular transformers library provided by
Huggingface.7 Conversion scripts are available to prepare OPUS-MT models for the public model hub
making them available to the NLP community and also accessible through the online inference API.
Similarly, we also integrate OPUS-MT models in the European Language Grid (ELG). Dockerized
OPUS-MT servers run on the ELG platform making it possible to directly access translation models
from the ELG cloud services and the APIs provided by the infrastructure. The same docker images can
also be downloaded from DockerHub and may run locally or on other cloud infrastructures.
Addressing the needs of professional translators is done by OPUS-CAT,8 a collection of tools and plug-
ins that add the power of OPUS-MT to computer-assisted translation (CAT) workflows in Trados Studio,
memoQ, and OmegaT. In some CAT tools, such as Wordfast, OPUS-CAT can be used by connecting
directly to its API through a custom MT provider functionality. OPUS-CAT also includes a Chrome
browser extension, which makes it possible to use OPUS-MT in browser-based CAT tools. The Chrome
extension currently supports Memsource9 and XTM.10 Different to other solutions, OPUS-CAT runs MT
locally and does not require to send data to any external service. This has huge advantages in terms of
data privacy and security and also enables fine-tuning of local translation engines on custom data without
compromizing data safety.
3.2 Benchmarks and evaluation
Monitoring language coverage and translation quality is important to keep track of the progress in our
mission to improve language support and cross-lingual information accessibility using open MT solu-
tions. Therefore, we systematically run benchmarks on all our models using a wide range of test sets
coming from established evaluation campaigns. Automatic evaluation is certainly not sufficient but still
7https://huggingface.co/transformers/
8https://helsinki-nlp.github.io/OPUS-CAT/
9https://www.memsource.com/
10https://xtm.cloud/
136
provides a good indication of quality especially if several benchmarks are used in parallel instead of
relying on single test sets.
We aim at a comprehensive collection of benchmarks11 and results are stored in a public repository,12
which can be explored in a public leaderboard.13 Translation results are also kept in the same repository
to make it possible to run further qualitative studies on actual output of each model. Finally, we also
create dynamic maps that visualize language coverage according to geographic locations of languages
supported by OPUS-MT (see Figure 2).
4 Conclusions
Above, we have shown how OPUS developed from a small data collection initiative to a mature ecosys-
tem for research on large coverage machine translation. All the components connected to OPUS provide
a complete framework for systematic experiments and state-of-the-art neural MT development. The main
building blocks refer to data collection, data curation, model training, system evaluation as well as de-
ployment and MT integration tasks. The data collection itself is extensive but the coverage of released
MT models is also impressive already. A lot of further work is on-going including the implementation of
modular multilingual machine translation and the development of speed-optimized compact translation
models using various kinds of knowledge distillation and quantization. Further integration into end-user
applications on various devices are planned as well and translation quality and language coverage are
constantly improved.
References
Mikko Aulamo, Umut Sulubacak, Sami Virpioja, & Jörg Tiedemann. 2020a. OpusTools and Parallel Corpus
Diagnostics. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 3782–3789,
Marseille. European Language Resources Association.
Mikko Aulamo, Sami Virpioja, & Jörg Tiedemann. 2020b. OpusFilter: A Configurable Parallel Corpus Filtering
Toolbox. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System
Demonstrations, pages 150–156. Association for Computational Linguistics.
Lars Borin. 1998. ETAP: Etablering och annotering av parallellkorpus för igenkänning av översättningsekviva-
lenter (ETAP: Creating and annotating a parallel corpus for the recognition of translation equivalents). ASLA
Information, 24(1):33–40.
Lars Borin. 2000a. ETAP project status report December 2000. Technical report, Uppsala University, Department
of Linguistics.
Lars Borin. 2000b. You’ll Take the High Road and I’ll Take the Low Road: Using a Third Language to Improve
Bilingual Word Alignment. In COLING 2000 Volume 1: The 18th International Conference on Computational
Linguistics.
Lars Borin. 2002. Alignment and tagging. In Parallel corpora, parallel worlds. Selected papers from a sympo-
sium on parallel and comparable corpora at Uppsala University, Sweden, 22-23 April, 1999, Language and
computers: studies in practical linguistics, pages 207–218. Amsterdam: Rodopi.
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda
Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, & Jeffrey Dean. 2017. Google’s Multilingual
Neural Machine Translation System: Enabling Zero-Shot Translation. Transactions of the Association for Com-
putational Linguistics, 5:339–351.
Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Necker-
mann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, André F. T. Martins, & Alexandra
Birch. 2018. Marian: Fast Neural Machine Translation in C++. In Proceedings of ACL 2018, System Demon-
strations, pages 116–121, Melbourne. Association for Computational Linguistics.
11https://github.com/Helsinki-NLP/OPUS-MT-testsets
12https://github.com/Helsinki-NLP/OPUS-MT-leaderboard/
13https://opus.nlpl.eu/leaderboard/
137
Taku Kudo & John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer
and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in
Natural Language Processing: System Demonstrations, pages 66–71, Brussels. Association for Computational
Linguistics.
Robert Östling & Jörg Tiedemann. 2016. Efficient word alignment with Markov Chain Monte Carlo. Prague
Bulletin of Mathematical Linguistics, 106:125–146.
Anna Sågvall Hein, Eva Forsbom, Jörg Tiedemann, Per Weijnitz, Ingrid Almqvist, Leif-Jöran Olsson, & Sten
Thaning. 2002. Scaling Up an MT Prototype for Industrial Use - Databases and Data Flow. In Proceedings
of the 2nd International Conference on Language Resources and Evaluation, (LREC’2002), volume V, pages
1759–1766, Las Palmas de Gran Canaria.
Rico Sennrich, Barry Haddow, & Alexandra Birch. 2016. Improving Neural Machine Translation Models with
Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 86–96, Berlin. Association for Computational Linguistics.
Jörg Tiedemann & Lars Nygaard. 2004. The OPUS Corpus - Parallel and Free: http://logos.uio.no/opus. In
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon.
European Language Resources Association (ELRA).
Jörg Tiedemann & Santhosh Thottingal. 2020. OPUS-MT - Building open translation services for the world.
In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation (EAMT).
European Association for Machine Translation.
Jörg Tiedemann. 2009. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Inter-
faces. In N. Nicolov, K. Bontcheva, G. Angelova, & R. Mitkov, editors, Recent Advances in Natural Language
Processing, volume V, pages 237–248. John Benjamins, Amsterdam/Philadelphia, Borovets.
Jörg Tiedemann. 2012. Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the Eighth Interna-
tional Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul. European
Language Resources Association (ELRA).
Jörg Tiedemann. 2020. The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilin-
gual MT. In Proceedings of the Fifth Conference on Machine Translation, pages 1174–1182, Online. Associa-
tion for Computational Linguistics.
Erik Tjong Kim Sang. 1999. Aligning the Scania Corpus. Working Papers in Computational Linguistics &
Language Engineering, 18.
Raúl Vázquez, Umut Sulubacak, & Jörg Tiedemann. 2019. The University of Helsinki Submission to the WMT19
Parallel Corpus Filtering Task. In Proceedings of the Fourth Conference on Machine Translation (Volume 3:
Shared Task Papers, Day 2), pages 294–300, Florence. Association for Computational Linguistics.
138
Binomials in Swedish corpora – ‘Ordpar 1965’ revisited
Martin Volk and Johannes Graën
Department of Computational Linguistics
University of Zurich
volk|graen@cl.uzh.ch
Abstract
This paper describes a corpus study on Swedish binomials, a special type of multi-word expres-
sions. Binomials are of the type X conjunction Y where X and Y are words, typically of the same
part-of-speech. Bendz (1965) investigated the various use cases and functions of such binomials
and included a list of more than 1000 candidates in his appendix. We were curious to what extent
these binomials can still be found in modern corpora. We therefore checked this list against the
Swedish Europarl and OpenSubtitles corpora. We found that many of the binomials are still in
use today even in these diverse text genres. The relative frequency of binomials in Europarl is
much higher than in OpenSubtitles.
Foreword
It is a great honor to contribute to this Festschrift for Lars Borin. Lars has been our source of inspiration
for many years. He worked on parallel corpora (Borin, 2002) and word alignment (Borin, 2000) before we
thought about it. He published on named entity recognition for the digital humanities (Borin et al., 2007)
and (Borin et al., 2014), on the architecture and processing pipeline of Swedish Språkbanken (Borin et
al., 2012) and (Borin et al., 2016), and many other topics.
Over the years we profited enormously from meetings and discussions with Lars. In addition to being
a knowledgeable discussion partner, we would like to thank him for being a great colleague and friend.
We dedicate our little corpus study to him.
1 Introduction
Binomials are an interesting case of multi-word expressions. Binomials are patterns of the type X con-
junction Y where X and Y are words, typically of the same part-of-speech, connected by a conjunction
(most often by and or its equivalent in the respective languages).
Some time ago, we investigated adverbial binomials (Volk et al., 2016; Volk & Graën, 2017; Graën &
Volk, 2021) in a number of languages. Our studywas initially motivated by the observation that some such
binomials are homonyms with coordinated prepositions (e.g. DE: ab und zu, EN: on and on, SV: till och
med) and require special treatment in automated language processing. We searched our corpora for special
PoS patterns (e.g. with adjectives, adverbs, particles and prepositions) and measured the idiomaticity of
candidates with the reversibility score, cf. Mollin (2014)1, with a mutual information score and with an
entropy measure in order to identify larger fixed expressions containing the binomial (e.g. FR: d’ores et
déjà, DE: mehr als je zuvor, EN: both internally and externally, SV: om och om igen).
For Swedish we found that till och med is by far the most frequent adverbial binomial, followed by först
och främst and helt och hållet. Modern Machine Translation systems are surprisingly good at translating
these fixed expressions, but parsers often have difficulties integrating them correctly into the syntactic
structure.
1If X conj Y is much more frequent than Y conj X, then this is evidence for a fixed (i.e. idiomatic) expression.
Martin Volk and Johannes Graën. 2022. Binomials in Swedish corpora – ‘Ordpar 1965’ re-
visited. In Volodina, Dannélls, Berdicevskis, Forsberg and Virk (editors), Live and Learn –
Festschrift in honor of Lars Borin, pages 139–144. Available under CC BY 4.0 139
Figure 1: Excerpt from the Ordpar appendix in (Bendz, 1965), page 93. The examples display ordering
variants, language information (da., ty., gr., lat., etc.), and optional parts (e.g. bryta) as well as references
to the Bible and other publications.
A recent publication in Språktidningen (Landberg, 2022) reminded us that the topic of word pairings
is still relevant for linguistics and also of interest for the social sciences. The article highlighted the use
of coordinated words in political discourse. We therefore returned to an open edge of our previous work.
During our past investigation we had encountered (Bendz, 1965), an early work on Swedish binomials
which includes an appendix with more than 1000 candidates. For the current paper, we turned this list into
a computer-digestible format (and termed it “the Ordpar list” after the book’s name). And we checked the
binomials of this list in two corpora of Swedish. We are interested to find out which of these binomials
are used in modern Swedish.
1.1 The Ordpar list
Bendz (1965) introduces his list as: “Nedan följer en förteckning över svenska ordpar (samt några län-
gre sammanställningar) av högst olika ålder, typ, stilvalör och frekvens. … Listan omfattar endast ordpar
vilkas komponenter är samordnade med ‘och’, ‘eller’, ‘men’ eller asyndetiskt…Den är givetvis inte på nå-
got sätt fullständig, men, särskilt vad beträffar synonymier, tillräckligt rikhaltig för att vara representativ.”
(Bendz, 1965, page 90).
After the first step of scanning, digitizing and structuring the list, we counted 1183 raw entries for
Swedish2. The vast majority (987) are X och Y cases, where X and Y consist of only one token. See
figure 1 for examples. Only 13 entries are of the type X eller Y (here: bära eller brista), and 5 entries are
of X men Y.
In addition, there are entries with alternatives of the three conjunctions (12 entries with och+eller (e.g.
tid och (eller) tillfälle), 4 entries with eller+och (e.g. mer eller (och) mindre), and 2 entries for men+och
(e.g. sakta men (och) säkert)). We assume that the order of the conjunctions indicates a preference.
This leaves 160 entries in the Ordpar list with other characteristics. Many of them contain another token
either as a modifier (e.g. fläsk och bruna bönor), in a larger expression (e.g. mellan barken och trädet),
or in a listing (e.g. guld, rökelse och myrra). In a number of cases, the additional token is in parentheses
and constitutes a variant (e.g. blommor (blomst) och blad(er)). We decided to unfold these cases into new
entries. This means, we interpret this entry as standing for blommor och blad, blomst och blad, blommor
och blader, blomst och blader. In this way we processed 139 variant entries and unfolded them to 300
generated candidates.
It should be noted that the original Ordpar list contains 88 entries in both orders (e.g. it contains both
kall och klar and klar och kall) which we counted as two entries so far. For some of these entries Bendz
2Bendz (1965) also features shorter lists of binomials for German, English, French, Italian, Spanish, Latin, and Greek, which
we ignore here. We make the digitized Swedish list available at https://pub.cl.uzh.ch/purl/ordpar_list
140
X C Y f(X,C) f(C,Y) f(X,C,Y) local-MI f(Y,C,X) OS rank
till och med 8 037 16 756 6 621 0,002 031 7 1
först och främst 3 916 3 938 3 857 0,001 442 13
helt och hållet 3 188 2 532 2 532 0,000 970 5
i och med 4 089 16 756 3 008 0,000 909 13 31
var och en 2 078 13 899 1 768 0,000 558 7
saker och ting 1 334 1 219 1 217 0,000 509 2
klart och tydligt 1 357 1 476 1 121 0,000 456 116 17
mer eller mindre 812 910 799 0,000 346 1 8
grund och botten 843 625 624 0,000 273 27
helt och fullt 3 188 694 590 0,000 222 2 53
kvinnor och barn 2 818 813 589 0,000 220 49 11
sätt och vis 1 764 396 386 0,000 156 6
förr eller senare 230 257 228 0,000 110 3
är och förblir 955 280 236 0,000 100 90
en och samma 554 602 229 0,000 095 25
rättigheter och skyldigheter 5 293 335 258 0,000 090 20 426
tack och lov 247 167 167 0,000 081 4
fullt och fast 177 194 140 0,000 068 3 107
rätt och riktigt 801 216 155 0,000 066 2 225
lugn och ro 207 134 132 0,000 065 9
Table 1: List of Ordpar binomials in Europarl ranked by the local mutual information score
local-MI(X,C, Y ). Please note that f(X,C, Y ) is the frequency of the triple X conjunction Y while
f(Y,C,X) displays the frequency of the reverse order, if at all present in the corpus. The rightmost col-
umn displays the rank in the OpenSubtitles corpus.
identifies the preferred order (e.g. “reda och ordning – oftast: ordning och reda”). In a few cases, the
preference information comeswith a temporal judgment (e.g. “jämt och samt – förr även: samt och jämt”).
We mark all these double-order entries since we will check both orders for all binomials always in order
to compute the reversibility score.
To many entries, Bendz (1965) added information about corresponding occurrences of the binomial in
other languages: Danish (200 entries), German (132), Latin (27), English (19), and single digit numbers
for Dutch, French, Italian, Spanish, Hebrew and Greek. And many binomials in the list also have refer-
ences to the Bible (e.g. anda och sanning; vägen, sanningen och livet), to Svenska Akademiens Ordbok
and to other previous work. Interestingly Bendz’ Ordpar list does not contain any reduplication binomials
(as e.g. mer och mer, igen och igen).
Our goal is to check the occurrence frequencies of all Ordpar entries in the Swedish parts of the Europarl
(Koehn, 2005) and the OpenSubtitles corpus (Lison & Tiedemann, 2016), and see by comparison whether
binomials show a stable distribution pattern. The input to our investigations in the following sections are
1101 binomial candidates of the type X conjunction Y explicitly or implicitly in the Ordpar list.
2 Binomials in the Swedish Europarl corpus
We cleaned the Europarl Corpus for corpus linguistics studies (Graën et al., 2014). And we processed
both the Europarl and the OpenSubtitles corpus with Stanza (Qi et al., 2020), which leaves us with 35 and
240 million token, respectively. After cleaning and unfolding, the Ordpar list contains 1055 binomials of
type X och Y (some of which are form variants as described above). We searched all 1055 binomials in
both orders (i.e. as X och Y and as Y och X) in both corpora.
In order to measure the idiomaticity we computed the local mutual information scores (local-MI) for
141
X C Y f(X,C) f(C,Y) f(X,C,Y) local-MI f(Y,C,X) EP rank
till och med 19 649 27 858 16 017 0,000 851 1
saker och ting 5 816 4 214 4 187 0,000 266 6
förr eller senare 2 916 3 056 2 816 0,000 189 13
tack och lov 3 498 3 012 2 789 0,000 185 17
helt och hållet 2 480 2 135 2 119 0,000 145 3
sätt och vis 2 671 2 219 2 130 0,000 145 2 12
var och en 4 493 49 610 2 650 0,000 126 13 5
mer eller mindre 1 512 1 602 1 379 0,000 097 2 8
lugn och ro 2 357 1 423 1 328 0,000 091 1 20
in och ut 11 382 2 645 1 590 0,000 090 573 111
kvinnor och barn 2 304 3 413 1 291 0,000 081 23 11
fram och tillbaka 3 535 2 008 1 178 0,000 074 4 31
först och främst 2 150 1 081 1 056 0,000 073 2
liv och död 4 664 1 697 987 0,000 061 8 51
vänta och se 1 211 14 116 897 0,000 050 39
kött och blod 1 283 1 165 701 0,000 048 10 122
klart och tydligt 907 834 632 0,000 046 27 7
in eller ut 899 662 593 0,000 044 181 199
dag och natt 3 009 652 621 0,000 041 212 68
gott och ont 1 112 630 542 0,000 039 26 57
Table 2: List of Ordpar binomials in OpenSubtitles ranked by the local mutual information score
local-MI(X,C, Y ). Here also f(X,C, Y ) is the frequency of the triple X conjunction Y, and f(Y,C,X)
is the frequency of the reverse order, if at all present in the corpus. The rightmost column displays the
rank in the Europarl corpus as an indication of the varying prominence of the binomial in different text
genres.
all candidates “X C Y” where C is the conjunction.3 For this we used the bigram frequencies of “X C”
and “C Y” in comparison with the frequency of the triple “X C Y”. Our formula is
f(X,C, Y ) × N × f(X,C, Y )local-MI(X,C, Y ) = log
N 2 f(X,C)× f(C, Y )
with N being the number of tokens in the corpus. In this way, the local-MI score predicts the probability
of “X C” being followed by Y, and the likelihood of “C Y” being preceded by X. A shorter way of writing
this with “Observed” versus “Expected” variables (the respective frequencies divided by N) is
O
local-MI(X,C, Y ) = O × log2 E
We first search with lower-cased tokens from the corpus which results in 29 345 binomial occurrences
of 431 different types in Europarl and 69 849 occurrences of 834 types in OpenSubtitles.
Table 1 shows the Ordpar binomials with the highest local mutual information scores in the Europarl
corpus. Out of the 431 different binomials 86 occur in both orders, 49 with a combined frequency of 10
or more. We observe that the top candidates have clear ordering preferences with very few occurrences in
the reverse order. On the opposite end with more balanced variants (and thus allegedly more composite
meaning) are nu och då vs. då och nu (9 vs. 7), sedlar och mynt (78 vs. 48), and höger och vänster (38
vs. 24). Since our Europarl corpus is lemmatized, we may alternatively search the Ordpar binomials via
the lemmas in the corpus. In theory this will conflate e.g. hel och full (173 hits) and helt och fullt (425
3See Evert (2008, page 1226) for a motivation for this measure.
142
PoS Europarl OpenSubtitles
frequency ratio frequency ratio
ADJ 92 979 21,2% 96 238 11,6%
ADP 14 180 3,2% 30 465 3,7%
ADV 17 414 4,0% 74 891 9,0%
NOUN 277 744 63,3% 334 606 40,2%
PRON 7 479 1,7% 142 737 17,2%
VERB 29 242 6,7% 153 321 18,4%
Total 439 038 832 258
Table 3: List of absolute and relative frequencies of different PoS in binomial configurations.
hits) into the same binomial and combine their frequency counts. But since the Ordpar binomials are
not lemmatized systematically, this does not work. Therefore we search only over the word forms in the
corpus.
The PoS tags for the found binomials are not perfectly reliable since often the binomials constitute a
special syntactic environment. But the top frequencies of the PoS pairs for the Ordpar binomials might
still give us an indication of the most typical ones. The most frequent PoS pairs are adverbs (14 456 hits),
nouns (5156), adjectives (2698), pronouns (1860), and verbs (449).
For the other conjunctions (eller, men) we run the same counting experiments. The unfolded Ordpar
list has 32 binomials with eller which we investigate in either order (via 59 trigger words involved). We
find a total of 18 binomials in Europarl (when searching via tokens, i.e. word forms), but only four of them
add up to more than 10 hits: mer eller mindre (799 hits), förr eller senare (228 hits), liv eller död (21 hits),
and nu eller aldrig (12 hits). The unfolded Ordpar list contains 8 different binomials with the Swedish
conjunction men, 3 of which occur in Europarl: sakta men säkert (101 hits), långsamt men säkert (13
hits) and hårt men rättvist (1 hit).
Overall this means that Ordpar binomials occur a total of 29 345 times (28 111 for och, 1095 for eller,
115 formen and 24 for other conjunctions, namely om, som and över) in Europarl (when counted via word
forms) which corresponds to a relative frequency of 838.4 per 1 million tokens in the Europarl corpus.
3 Binomials in Swedish OpenSubtitles
For comparison we investigate the Swedish OpenSubtitles corpus (240 million tokens) with the same
methods as above.
Table 2 shows the top ranked binomials from the Ordpar list. Many binomials differ in rank because
of the different text genre. For example liv och död which is on rank 14 in OpenSubtitles, is only on rank
51 in Europarl. It is also noteworthy that in och ut has a high local-MI score even though it also has a
frequent reversibility option (1590 vs. 573).
Based on the Ordpar list we find 69 849 binomials in OpenSubtitles. This results in a relative frequency
of 290.6 Ordpar binomials per 1 million tokens in OpenSubtitles.
We also calculated the frequencies of binomial candidates in both corpora, searching for the pattern
X conjunction Y with X and Y being of the same PoS (shown in Table 3). This gives us an indication
of the typical types of binomials in each genre. In comparison, we clearly see that binomials are much
more popular in Europarl and that adjective and noun binomials are used considerably more often in the
Europarl debates (with OpenSubtitles being 8 times in size).
4 Conclusion
The Ordpar list as compiled by (Bendz, 1965) more than 50 years ago still serves as a source of inspiration
for the investigation of binomials today. It is fascinating to observe that many of the idiomatic binomials in
the Ordpar list are still in use in modern texts as diverse as parliamentary proceedings and movie subtitles.
143
As suspected we find more binomials in parliamentary proceedings (Europarl) than in subtitles. We
suspect that this is due to the constraint that subtitles must be short and concise, whereas binomials are
often repetitive (e.g. först och främst instead of först) and thus add to the length.
References
Gerhard Bendz. 1965. Ordpar. P. A. Nordstedt & Söners Förlag, Stockholm.
Lars Borin, Dimitrios Kokkinakis, & Leif-Jöran Olsson. 2007. Naming the past: Named entity and animacy
recognition in 19th century Swedish literature. In Proceedings of The ACL Workshop on Language Technology
for Cultural Heritage Data (LaTeCH), Prague.
Lars Borin, Markus Forsberg, & Johan Roxendal. 2012. Korp — the corpus infrastructure of språkbanken. In
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC), pages
474–478, Istanbul. European Language Resources Association (ELRA).
Lars Borin, Dana Dannélls, & Leif-Jöran Olsson. 2014. Geographic visualization of place names in Swedish
literary texts. Literary and Linguistic Computing, 29(3).
Lars Borin, Markus Forsberg, Martin Hammarstedt, Dan Rosén, Roland Schäfer, & Anne Schumacher. 2016.
Sparv: Språkbanken’s corpus annotation pipeline infrastructure. In The Sixth Swedish Language Technology
Conference (SLTC), pages 17–18, Umeå.
Lars Borin. 2000. You’ll take the high road and I’ll take the low road: Using a third language to improve bilingual
word alignment. In Proceedings of the 18th International Conference on Computational Linguistics (COLING),
pages 97–103, Saarbrücken.
Lars Borin, editor. 2002. Parallel Corpora, Parallel Worlds. Selected Papers from a Symposium on Parallel and
Comparable Corpora at Uppsala University, Sweden, 22-23 April, 1999, volume 43 of Language and Computers.
Rodopi, Amsterdam.
Stefan Evert. 2008. Corpora and collocations. In A. Lüdeling & M. Kytö, editors, Corpus Linguistics. An Interna-
tional Handbook, volume 2, pages 1212–1248. Walter de Gruyter.
Johannes Graën & Martin Volk. 2021. Binomial adverbs in Germanic and Romance languages. a corpus-based
study. In Julia Lavid-López, Carmen Maíz-Arévalo, & Juan Rafael Zamorano-Mansilla, editors, Corpora in
Translation and Contrastive Research in the Digital Age. Recent advances and exploration, chapter 13, pages
326–342. John Benjamins.
Johannes Graën, Dolores Batinic, & Martin Volk. 2014. Cleaning the Europarl corpus for linguistic applications.
In Proceedings of the Conference on Natural Language Processing (KONVENS), pages 222–227. Stiftung Uni-
versität Hildesheim.
Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the 10th
Machine Translation Summit, volume 5, pages 79–86. Asia-Pacific Association for Machine Translation.
Joel Landberg. 2022. Ett “och” betyder så mycket. Språktidningen, pages 24–30.
Pierre Lison & Jörg Tiedemann. 2016. OpenSubtitles2016: Extracting Large Parallel Corpora fromMovie and TV
Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC).
Sandra Mollin. 2014. The (Ir)reversibility of English Binomials. Corpus, Constraints, Developments, volume 64
of Studies in Corpus Linguistics. John Benjamins.
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, & Christopher D. Manning. 2020. Stanza: A Python natural
language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics: System Demonstrations.
Martin Volk & Johannes Graën. 2017. Multi-word adverbs – how well are they handled in parsing and machine
translation? In Proceedings of The 3rd Workshop on Multi-word Units in Machine Translation and Translation
Technology (MUMTTT), London.
Martin Volk, Simon Clematide, Johannes Graën, & Phillip Ströbel. 2016. Bi-particle adverbs, PoS-tagging and
the recognition of German separable prefix verbs. In Proceedings of the 13th Conference on Natural Language
Processing (KONVENS), pages 297–305.
144
ICALL: Research versus reality check
Elena Volodina David Alfter
Språkbanken Text CENTAL
University of Gothenburg, Sweden Université catholique de Louvain
elena.volodina@svenska.gu.se david.alfter@uclouvain.be
Abstract
Intelligent Computer-Assisted Language Learning has been one of Lars Borin’s research interests.
The work on the Lärka language learning platform has started under his coordination. We see it
our mission to make the platform live and prosperous, and through it to stimulate research into
Swedish as a second language. Below, we name some weaknesses we have identified in Lärka
while working with a course of beginner Swedish and outline our plans for tackling those.
1 Introduction
Research agenda is – by definition – driven by the research needs. However, the aim of the research
is to explain the world around us as well as to explore and outline new ways of approaching certain
phenomena (Hempel, 1967). It should, thus, aim to align with the real world needs, at least to a certain
degree. This is especially true of research aimed at language learning.
Språkbanken Text1 has for at least a decade been involved in research around Intelligent Computer-
Assisted Language Learning (ICALL), with the major focus on Swedish as a second language (L2
Swedish), for example, Volodina et al. (2016), Pilán (2018), Alfter (2021). As part of these research
activities, a platform for language learning, Lärka,2 has been implemented (Volodina et al., 2014; Volo-
dina & Pijetlovic, 2015; Alfter et al., 2019a). Lärka is used for staging experiments, testing prototypes,
for demo purposes as well as for collection of research data and learner logs for further analysis.
Probably, the most significant impact of Lärka on research could be expected from research data col-
lection, which could feed into new research insights relevant for the field of L2 Swedish and Second
Language Acquisition (SLA). However, the major problem has up to now been to attract language learn-
ers to use Lärka. While Lärka exercise types demonstrate the capacity to use language technology for
language learning purposes, the basic pedagogical aspects, unfortunately, are not considered, and that is
a hypothetic reason why neither teachers nor learners see benefits of using the platform to the extent that
can help generate new research data. The basic question is therefore – how can this be changed?
The recent effort by the authors to set up a self-study course in basic Swedish for Ukrainian refugees,
SwedishFromScratch3 (SFS), has become a well-needed reality check. SFS, the course which we make
available through a dedicated channel on the social media platform Telegram4 – while having the initial
aim to help one of the authors’ relatives fleeing from the war in Ukraine – has collected over 1800
followers in a couple of months. We receive messages from the users with thanks, reported problems and
requests for desired features and materials (all of which witness active usage). That extent of acceptance
has taken us by surprise and put the previous work on Lärka into a new perspective, which we are trying
to analyze and present in this paper.
1https://spraakbanken.gu.se/
2https://spraakbanken.gu.se/larkalabb/
3https://spraakbanken.gu.se/en/research/themes/icall/swedish-from-scratch
4https://t.me/quizlet4swedish
Elena Volodina and David Alfter. 2022. ICALL: Research versus reality check. In Volodina,
Dannélls, Berdicevskis, Forsberg and Virk (editors), Live and Learn – Festschrift in honor of
Lars Borin, pages 145–151. Available under CC BY 4.0 145
Below, we describe Lärka and the course SwedishFromScratch (Sections 2 and 3), present our insights
from comparison of the two (Section 4), and conclude by our plans to move forward (Section 5).
2 Lärka in a nutshell
Lärka is a research platform that offers, among other things, automatically generated exercises based on
authentic corpus material (Volodina et al., 2012; Pilán et al., 2014; Lindström Tiedemann et al., 2016),
a text evaluation tool (Pilán et al., 2016), a lexicographic annotation tool (Alfter et al., 2019b), and the
Swedish L2 profile (Volodina et al., 2021).
For exercise generation, there are highly customizable linguistic knowledge exercises for parts-of-
speech, syntactic relations, and semantic roles (Volodina et al., 2014; Pilán & Volodina, 2014). For
language learners, there are multiple-choice exercises (vocabulary and inflection), a listening exercise
(Volodina & Pijetlovic, 2015), and two gamified exercises: Wordguess, a hangman-type game based
on dictionary translations, and a particle verb exercise based on parallel multilingual corpora (Alfter &
Graën, 2019).
3 SwedishFromScratch: an outline
SwedishFromScratch is a free course that can be followed by anyone using Telegram channel where
notifications about new lessons are sent to subscribers. Up to now the course has been a combination of
several types of lessons:
1. Quizlet lessons5 primarily introduce Swedish vocabulary and phrases with their pronunciation and
translation into Russian. We select vocabulary for training ourselves and create Quizlet flashcards
manually. A set of exercises are automatically generated by Quizlet based on the flashcards, e.g.
listening, spelling, translation, matching, test. Besides, we also introduce and train some grammar
phenomena using Quizlet functionalities. The collectiion of lessons is cotinuously growing.
2. Clilstore lessons6 present texts for practicing reading which we write ourselves (with a few ex-
ceptions) to avoid copyright issues. On the starting page, one can select language sv and order by
Title. The numbered texts (currently at A1-B1 levels) belong to this course. Once inside a text, the
Clilstore platform (Gimeno & Dónaill, 2008) allows to click on words, which opens a window for
dictionaries to the right. The user can set a language for translation (Russian is this case) and a
dictionary, e.g. Lexin (Hult et al., 2010) at the beginner levels. Currently, there are 24 Clilstore texts
connected to the SFS course, and the collection is constantly growing.
3. Grammar explanations are reused from several sources, among others SFI grammar7 that we trans-
late into Russian,8 and online encyclopedia of Swedish in Russian (Maslova-Lashanskaya, 1953).9
4. Grammar exercises10 are based on the course texts we publish in Clilstore. The texts are uploaded
to the generator, automatically analyzed for parts-of-speech, and gapped items are generated for a
selected word class. To avoid errors, automatic annotation is manually checked, which is currently a
bottleneck in the process. Only the first six texts out of the 24 available are prepared for the grammar
exercise training module.
5. Third-party open materials are used to complement the course, e.g. UR språkplay11 a set of films
and series with subtitles for each phrase you hear, and a translation into the language you choose.
5https://quizlet.com/class/22036236/
6https://clilstore.eu/clilstore/
7https://sfipatxi.wordpress.com/grammatik/
8https://elenavolodina.github.io/SwedishFromScratch/Grammar
9https://svspb.net/bok/
10A prototype of an exercise: https://spraakbanken.gu.se/larkalabb/sfs/?lesson_number=3&pos=NOUN&tl=english
11https://www.ur.se/sprakplay/#/programs
146
Figure 1: Pros and cons of the used tools.
All lessons are sequenced in the order they are suggested to be learnt. Learners may follow a different
path, if they wish, repeat lessons, skip lessons or take a break in the course. Since this is a self-study
course, learners can also adapt the course to their time restrictions.
In the process of preparing lessons, we function as teachers, search for tools that can address the course
needs, and continuously evaluate Lärka from that point of view.
4 Lessons learned
Previous research, e.g. Arhar Holdt et al. (2020), Burstein et al. (2009), suggests that teachers are likely
to adopt (technical) innovations only if those can fit logically into teachers’ daily routines. In this connec-
tion, Burstein et al. (2012) identify five critical components for technical innovations aimed at language
classrooms, three of which we mention here:
• Relevance to the curriculum standards and lesson objectives, i.e. suggested technological solutions
should be able to provide support with immediately relevant tasks.
• Ability to keep learners motivated and focused on the learning goals, i.e. the innovations should not
distract from the learning goals, but help concentrate on them.
• Potential to independent practices, i.e. technical solutions should facilitate independent learner work,
e.g. creation of activities that can be given to learners for self-study or homework.
147
(a) Manual correction
(b) Exercise options
Figure 2: Authoring tool prototype.
All of the above criteria are met by the SwedishFromScratch, and only the last one can be claimed to be
met by Lärka. Most critically, the exercises in Lärka use random words of a certain level of proficiency
for exercise generation. This does not encourage learning, since the vocabulary scope is too wide. In
the best case, random selection of vocabulary items facilitates testing or brings gaming experience. For
learning, there is a need to limit the number of words for each session, so that learning can actually occur.
Therefore, one important lesson from our SFS initiative is that we should offer teachers or learners control
over what they use for a lesson, e.g. a possibility to enter the words for training themselves, limiting the
scope to the relevant items.
Another issue concerns the context of Lärka exercises. Sentences, that are currently a unit used for
exercise generation, are exempt of copyright issues. However, they cannot frame a focus for a whole
148
lesson. Texts are a much better alternative in this respect. Even available texts on the Internet may still
have copyright problems if used for other purposes than the ones intended by the authors. Thus, the
option to upload their own texts or texts that they are sure can be used should at least be offered to the
users (cf Clilstore concept). This way we could collect through Lärka free reading materials for future
uses and to generate lesson materials based on those texts, for example covering vocabulary training,
morphology and grammar exercises, as well as syntax-focused exercises.
Figure 1 provides a summary of further pros and cons over the various tools discussed here. A look at
those suggests that each of these tools offers a unique type of language learning material generation and
sharing, but is insufficient on itself for setting up any comprehensive language learning course.
5 Research and practice: agenda
Returning to the question we asked at the beginning of this article – how can we attract more users to
Lärka and stimulate growth of learner-produced data for research – we can summarize the following:
In the best of worlds, Lärka’s NLP-based algorithms should be coupled with the functionalities offered
by Clilstore (texts with translation) and Quizlet (flashcards and exercise generation) or similar to achieve
maximal flexibility and usefulness for teachers and learners, and as a consequence to stimulate users to
produce research data. Importantly, this would form a sound pedagogical basis for creating self-study
courses, collecting all the currently distributed modules (in SFS or any other potential course) into one
space, offering
– for teachers – an easier way to author a language lesson or a course;
– for learners – a convenient way to follow the course;
– for researchers – a safe and stable way to collect user-generated data (with access to learners).
Language Muse (Burstein & Sabatini, 2016; Burstein et al., 2017) has used this type of approach and
it has shown to be working for both teachers, learners and researchers. Some planned practical step to
achieve the above in Lärka is to provide an authoring tool that:
1. Incorporates manual correction of automatic processes such as POS annotation (Figure 2).
2. Generates gap cloze items and potentially other items out of texts.12
3. Follows standard conventions such as leaving the first and last sentence in a text intact, replacing
only every x-th word, . . .
4. Gives more control to the course preparer by generating exercises semi-automatically based on
choices made by the teacher.
5. Can be extended to allow for more diverse types of exercises.
The current prototype (Figure 2) offers some of the above-mentioned control to the course author.
To summarize, we have two major conclusions to draw from this experience. First is a general one:
researchers need to have periodical reality checks to understand how to adjust their research agenda to
reality in order to prevent their research from turning into a selfish playground. Second conclusion has
relevance to the ICALL field: by giving (certain) control to teachers and learners, ICALL platforms (e.g.
Lärka) may become a win-win platform for both pedagogical and research scenarios - an insight that we
would not have gained without the reality check with the SFS course.
Acknowledgements
This work has been supported by Nationella språkbanken and HUMINFRA, both funded by the Swedish
Research Council (2018-2024, contract 2017-00626; 2022-2024, contract 2021-00176) and their partici-
pating partner institutions.
References
David Alfter & Johannes Graën. 2019. Interconnecting lexical resources and word alignment: How do learners
get on with particle verbs? In Proceedings of the 22nd Nordic Conference on Computational Linguistics, pages
321–326.
12https://spraakbanken.gu.se/larkalabb/sfs/?lesson_number=4&pos=PRON&tl=english
149
David Alfter, Lars Borin, Ildikó Pilán, Therese Lindström Tiedemann, & Elena Volodina. 2019a. Lärka: from
language learning platform to infrastructure for research on language learning. In Selected papers from the
CLARIN Annual Conference 2018, Pisa, 8-10 October 2018, number 159, pages 1–14. Linköping University
Electronic Press.
David Alfter, Therese Lindström Tiedemann, & Elena Volodina. 2019b. LEGATO: A flexible lexicographic
annotation tool. In Proceedings of the 22nd Nordic Conference on Computational Linguistics, pages 382–388.
David Alfter. 2021. Exploring natural language processing for single-word and multi-word lexical complexity
from a second language learner perspective. Doctoral Thesis. Univerisity of Gothenburg, Data Linguistica 31.
Špela Arhar Holdt, Rina Zviel-Girshin, Elżbieta Gajek, Isabel Durán-Muñoz, Petra Bago, Karën Fort, Ciler Hati-
poglu, Ramunė Kasperavičienė, Svetla Koeva, Ivana Lazić Konjik, et al. 2020. Language teachers and crowd-
sourcing: Insights from a cross-European survey. Rasprave: Časopis Instituta za hrvatski jezik i jezikoslovlje,
46(1):1–28.
Jill Burstein & John Sabatini. 2016. The Language Muse Activity Palette. Adaptive educational technologies for
literacy instruction, pages 275–280.
Jill Burstein, Jane Shore, John Sabatini, Brad Moulder, Steven Holtzman, & Ted Pedersen. 2012. The language
musesm system: Linguistically focused instructional authoring. ETS Research Report Series, 2012(2):i–36.
Jill Burstein, Nitin Madnani, John Sabatini, Dan McCaffrey, Kietha Biggers, & Kelsey Dreier. 2017. Generating
Language Activities in Real-Time for English Learners using Language Muse. In Proceedings of the Fourth
(2017) ACM Conference on Learning@ Scale, pages 213–215. ACM.
Jill Burstein. 2009. Opportunities for natural language processing research in education. In International Confer-
ence on Intelligent Text Processing and Computational Linguistics, pages 6–27. Springer.
A Gimeno & Ó Dónaill. 2008. C. & Zygmantaite, R.(2013). Clilstore Guidebook for Teachers. Tools for CLIL
Teachers.
Carl G Hempel. 1967. Philosophy of natural science. British Journal for the Philosophy of Science, 18(1).
Ann-Kristin Hult, Sven-Göran Malmgren, & Emma Sköldberg. 2010. Lexin-a report from a recycling lexico-
graphic project in the North. In Proceedings of the XIV Euralex International Congress (Leeuwarden, 6-10 July
2010).
Therese Lindström Tiedemann, Elena Volodina, & Håkan Jansson. 2016. Lärka: ett verktyg för träning av språk-
terminologi och grammatik. LexicoNordica, 23:161–181.
S.S. Maslova-Lashanskaya. 1953. Shvedsky yazyk: Chast pervaya. Otvetstvennyj redaktor prof. M.E. Steblin-
Kamensky. L.: Izd-vo LGU Zhdanova.
Ildikó Pilán & Elena Volodina. 2014. Reusing Swedish FrameNet for training semantic roles. In Proceedings of
the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 1359–1363.
Ildikó Pilán, Elena Volodina, & Richard Johansson. 2014. Rule-based and machine learning approaches for second
language sentence-level readability. In Proceedings of the ninth workshop on innovative use of NLP for building
educational applications, pages 174–184.
Ildikó Pilán, David Alfter, & Elena Volodina. 2016. Coursebook texts as a helping hand for classifying linguistic
complexity in language learners writings. In Proceedings of the Workshop on Computational Linguistics for
Linguistic Complexity (CL4LC), pages 120–126.
Ildikó Pilán. 2018. Automatic proficiency level prediction for Intelligent Computer-Assisted Language Learning.
Doctoral Thesis. Univerisity of Gothenburg, Data Linguistica 29.
Elena Volodina & Dijana Pijetlovic. 2015. Lark Trills for Language Drills: Text-to-speech technology for lan-
guage learners. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational
Applications, pages 107–117.
Elena Volodina, Lars Borin, Hrafn Loftsson, Birna Arnbjörnsdóttir, & Guðmundur Örn Leifsson. 2012. Waste not,
want not: Towards a system architecture for ICALL based on NLP component re-use. In Proceedings of the
SLTC 2012 workshop on NLP for CALL, pages 47–58.
150
Elena Volodina, Ildikó Pilán, Lars Borin, & Therese Tiedemann Lindström. 2014. A flexible language learning
platform based on language resources and web services. In Proceedings of LREC 2014, Reykjavik, Iceland,
pages 3973–3978.
Elena Volodina, Ildikó Pilán, & David Alfter. 2016. Classification of Swedish learner essays by CEFR levels.
CALL communities and culture–short papers from EUROCALL, 2016:456–461.
Elena Volodina, Yousuf Ali Mohammed, & Therese Lindström Tiedemann. 2021. CoDeRooMor: A new dataset
for non-inflectional morphology studies of Swedish. In Proceedings of the 23rd Nordic Conference on Compu-
tational Linguistics (NoDaLiDa), pages 178–189.
151

Lyxig språklig födelsedagspresent from the Swedish Word Family
Elena Volodina Yousuf Ali Mohammed
Språkbanken Text, University of Gothenburg, Sweden
name1.name2.surname@svenska.gu.se
Therese Lindström Tiedemann
University of Helsinki, Finland
therese.lindstromtiedemann@helsinki.fi
Abstract
Morphology and lexical resources are known to be two of Lars Borin’s biggest research passions.
We have, therefore, prepared a short description of a new kind of a lexical resource for Swedish,
the Swedish Word Family. The resource is compiled based on learner corpora, and contains lexical
items manually analyzed for derivational morphology.
1 Introduction
In the past couple of years, we have compiled a Swedish Word Family (SweWF) resource to study word
formation mechanisms in the Swedish learner language (publication is in preparation). SweWF can boast
unique features that differentiate it from the majority of other word (and morpheme) family resources,
e.g. Körtvélyessy et al. (2020), Nikolaev et al. (2019), Hiebert et al. (2018), Zeller et al. (2013), Baayen
et al. (1996), Bauer & Nation (1993):
1. we include compounds among the family members
2. we include multi-word expressions in the families
3. all lexical items in the SweWF resource contain other types of associated information, so that it is
possible to use lemmas, lemgrams and sense-based units for comparisons with, for example, reference
L1 corpora, depending on the type of units used in the reference corpora
4. the resource is descriptive in nature and as such it is possible to use it for both teaching and research.
The SwedishWord Family, based on the first version of CoDeRooMor (Volodina et al., 2021), contains
16 230 sense-based lemgrams organized into 4 400 word families (i.e. through shared roots). The size of
each particular family varies between 1 and 281 members. Most numerous are the families where the root,
if taken individually, is a preposition/particle, e.g ut (‘out, towards’), including such family members as
utbildning (‘education’), utsläpp (‘exhaust, release’), söderut (‘southward’).
The distribution of word families by their size in L2 Swedish corpora is shown in Figure 1 and Table 1.
Families with few members (1-9) constitute the majority of all families (87%), with only 13% of families
containing 10 or more members. Around 42% of word families (1868) in the Swedish word family
resource are singletons containing only one family member (e.g. one word like asfalt, astma, alzeimers).
Less than one percent of the families contain 61-281 members and these groups consist of roots that are
nearly exclusively either Scandinavian or common Germanic words.
The Swedish word family resource shows that word families with fewer members more often tend to
contain loanwords in contrast to the word families with larger number of members.
Elena Volodina, Yousuf Ali Mohammad and Therese Lindström Tiedemann. 2022. Lyxig
språklig födelsedagspresent from the Swedish Word Family. In Volodina, Dannélls,
Berdicevskis, Forsberg and Virk (editors), Live and Learn – Festschrift in honor of Lars Borin,
pages 153–160. Available under CC BY 4.0 153
Figure 1: Distribution of word families by number of members.
Nr of word familes Family size Percent of all word families Examples
1956 1 44.45 asfalt, astma, alzheimers
1886 2-8 42.86 lyx: lyxa, lyxig, lyxliv, ...
370 9-20 8.41 fam: familj, familjfoto, ...
159 21-60 3.61 nord: nordisk, Nordamerika, ...
15 61-100 0.34 språk: språklig, språkljud, ...
14 101-281 0.32 dag: måndag, daglig, ...
Table 1: Statistics over family sizes.
2 Hypotheses and case studies
The Swedish Word Family resource presents an opportunity to trace various trends in the language, in-
cluding linguistic, cultural and cognitive ones. In this paper we are looking into the following hypotheses:
(a) Simpler words (consisting of a minimal number of morphemes) within each family appear at
earlier levels and are more frequent.
(b) Related to above: relations between word family members are complexity ordered through
word formation mechanisms (inspired by Lango et al. (2021)) which is reflected in the order of
appearance of the new word family items in receptive and productive data.
2.1 Case Study 1: distribution of simplex root lexemes
To look into this hypothesis, we start from an analysis of the distribution of distinct simplex root lexemes
over the two corpora. By simplex root lexeme we understand lexical items that consist of one morpheme
only, namely strictly of one root, e.g. dag (’day’). By being distinct we mean that we account for each
simplex root lexeme only once, at the level where they occur for the first time. For example, dag may
have been used for the first time at A1 level, but is repeatedly used at all other levels. We count dag only
once, at the level of its first occurrence (i.e. A1). We, thus, do not count dag among root lexemes at levels
above A1.
154
Total A1 A2 B1 B2 C1 C2
SweLL pilot 1108 254 (52%) 347 (39%) 207 (22%) 189 (15%) 108 (10%) 3 (5%)
(productive) (23%) snö se mjölk lär värd yr blind falsk botten armé mord ström
Coctaill 2195 732 (36%) 499 (20%) 447 (12%) 361 (10%) feg 157 (7%)
(receptive) (16%) fisk gift paj skuld café dräng hinder slarv tvist 0
Table 2: Distribution of root lexemes across the two corpora per level: absolute number of simplex root
items (and their percentage of all new items at each level).
There are a total of 2298 distinct simplex root lexemes in the Swedish Word Family resource. These
are split between receptive (2195 items) and productive (1108 items), with an overlap of 1063 items as
shown in Table 2. These simplex root lexemes are more represented at earlier levels and gradually decline
as the level of proficiency grows. Figure 2 shows that more than half of the new vocabulary at A1 level
is constituted of lexemes consisting of only a root. This percentage drops gradually to 10% at C1 level
and to 5% at C2 level. The same tendency can be seen in the receptive corpus where most of the simplex
root lexemes are at A1 level with 36% and the number gradually drops to 7% at C1 level.
An interesting question is whether simplex root lexemes within respective word families tend to pre-
cede items that are more complex in terms of word formation (derivations and compounds). That is,
whether learners first get acquainted with, e.g. the simplex root lexeme dag, and then learn its derivations
(e.g. daglig, dagis) and compounds (e.g. måndag, vårdag). We have looked into several word families to
examine this assumption.
2.2 Case study 2: dag-family
The dag-family (‘day’) is one of the most numerous word families in the Swedish Word Family resource
with a total of 101 members in the Coctaill corpus, 32 of which are also represented in the SweLL-
pilot learner essays. As hypothesized, the simplex item consisting of only the root, dag, is introduced
before other items at A1 level together with some derivatives and compounds, namely days of the week
(lördag, fredag, etc.) and some of the words describing everyday routines, such as middag (‘dinner’),
dagis (‘kindergarden’), as well as parts of the day, eftermiddag (‘afternoon’), see Figure 3. It is obvious
from the word cloud in Figure 3 that the root lexeme dag is by far the most frequent in the dag-family at
the A1 level, judging by its relative size.
The dag-family is growing through numerous complex patterns of compounding and derivation, with
up to five roots within one item, e.g. här-om-dag-en (3 roots; ‘the other day’), föd-else-dag-s-pre-sent (3
roots; ‘birthday present’), sön-dag-s-efter-mid-dag (5 roots; ‘Sunday afternoon’). An interesting fact is
that most family members are nouns, with only a few adjectives (daglig, gammaldags), adverbs (dagligen,
häromdagen) and proper names, of which both designate names of newspapers (Dagens nyheter, Svenska
Dagladet); and there are no verbs apart from the multi-word expression sova middag (‘have an afternoon
nap’).
One more interesting observation is that the most radical expansion of the family happens at A2 and
B1 levels. Gradually, the new items decrease after these two levels, with as little as only seven new items
at C1 level. This is most probably due to the “topical” nature of the word dag and the fact that daily
routines have already been well covered at earlier levels.
2.3 Case study 3: språk-family
A predictable pattern of “easy first” can be traced also in the språk-family, exhibiting 62 family members,
with 57 of them appearing in the Coctaill corpus (i.e. in course book texts). The root lexeme språk appears
first at A1 level in both corpora and is the only representative of the family at that level (see center of
Figure 4). We can assume that its presence in the texts has a priming effect on learners, making it possible
to combine that root with a number of other roots and affixes at the next levels.
There are a few distinct patterns that we observe in the språk-family development across levels:
155
Figure 2: Simplex root lexemes in receptive and productive corpora.
Figure 3: Simplex root lexemes in receptive and productive corpora.
• (numeral/adjective root) + språk + adjectival suffix -ig, e.g. enspråkig (‘monolingual’), frispråkig
(‘free-spoken’), flerspråkig (‘multilingual’) → in turn, leading to a derivation pattern with suffix
-het thereof at the advanced levels, e.g. flerspråkighet (‘multilinguality’).
• compound nouns, ending in språk, that describe various types of languages, e.g. talspråk (‘spoken
language’), riksspråk (‘state language’), yrkesspråk (‘professional language’). The left hand element
is usually a noun used in attributive function. This seems to be one of the most productive patterns
of word-formation within this word family.
• språk used attributively to characterize nouns used as a right-hand element in compounds, e.g.
språkkurs (‘language course’), språkfamilj (‘language family’), språkpolitik (‘language policy’).
This pattern of word formation is highly productive, making the word family expand vastly by
C1 level, as shown in Figure 4.
The outlined tendencies observed in course books echo analysis of the word formation networks in
Czech and other languages, a representation of one of them shown in Figure 5. While Lango et al. (2021)
aimed at a general language description and analysis, we see similar patterns in the learner language. A
hypothesis that more complex patterns follow easier patterns requires, however, more thorough exam-
ination of far larger number of word families than we provide here, correlating those with each other
156
Figure 4: Distribution of språk-family in the Coctaill (receptive) data.
and with derivational families (i.e. families in a boarder sense, where lexical items are centered around a
shared affix, infix or other morpheme types).
Figure 5: A word formation network for Czech. A reprint from Lango et al. (2021).
All in all, the above analysis of the språk-family demonstrates a clear case where quite a few word-
formation patterns give rise to numerous new lexical items. An interesting fact is that most of the
språk-family members have been used very infrequently with only a few exceptions, such as flerspråkig,
hemspråk and teckenspråk, all of which appear at C1 level. And, of course, the core item itself, språk,
dominates at all levels, not only at the level of first occurrence; which is probably easily explainable by
the course book orientation. The predominant word formation mechanism is compounding.
2.4 Case study 4: lyx-family
Most families we have looked into seem to follow the above outlined path, i.e. introducing simpler
words before more complex ones. However, an important step when examining a hypothesis is to find
157
B1 B2 C1
lyx (‘luxury’); lyxa (‘to afford, in-
Coctaill lyxvara (‘luxury product’) —
dulge’); lyxig (‘luxurious’)
Swell — — lyxliv (‘luxury life’)
Table 3: lyx-family
counterexamples that could trigger new insights. One of such counterexamples is represented by the
lyx-family. It contains only 5 members, distributed in the data as shown in Table 3.
A more intuitive order of introduction, following the principle of ‘easy first’, would be:
lyx → lyxa, lyxig, lyxliv, lyxvara
However, we can see that the order of appearance of the words in the lyx-family is counter-intuitive: a
compound lyxvara (‘luxury item’) appears before the simplex root item lyx (‘luxury’) and its derivatives:
lyxig (‘luxurious’) and lyxa (‘to indulge, to treat yourself to luxury’). Several explanations come to mind
(apart from the obvious one that natural languages are idiosyncratic, and do not tend to adhere to rules):
The first one is connected to the reasoning about ‘what makes an item easy’. Up to now we operated
under the assumption that the simplest morphological structure is the main characteristics of an easy
item. This is, of course, a simplified view on reality. To complicate this, semantics could be another
constituent of the simplicity equation that needs to be considered, in this case, the dichotomy between
the concreteness and the abstractness of the words in the lyx-family. All of the lyx-items at B2 level have
an abstract meaning whereas the B1 item lyxvara is concrete. From the cognitive point of view it may be
easier to acquire a concrete item (lyxvara) than the other ones.
The second likely explanation could be the priming effect of the second family that lyxvara belongs
to, namely, var-family. If we examine how vara (‘product, item’) is used in the Coctaill texts, we will
see that up to B1 ten (10) family members are introduced, as shown in Table 4, and all of them are
compounds except the root lexeme itself (vara, ‘product, item’). Shopping seems to be one of the central
topics in texts at B1 level, since various types of products are introduced. The word formation pattern is
very similar between five of the six items containing the root var at B1 level: ‘a modifier describing the
type of product + vara’; lyxvara falls into this pattern, and becomes the first member of the lyx-family to
get introduced to language learners. It is possible to speak about a priming effect of statistically recursive
orthographic chunks (in this case ‘modifier+vara’), which, after repeated appearances, start to distinguish
themselves as separate morphemes (i.e. vara as a separate item, distinct from a range of modifiers, and
gradually, lyx becomes recognized as an independent lexical item).
Interestingly, even in the case of the var-family, we see that the first item introduced at A1 level is not
the root item vara, but the compound varuhus (‘store’, literally ‘house with goods’), see Table 4. Here it
would make sense to check the pattern of introduction of the hus-family members, and there is a good
reason to believe that this one would lead to another quest and become a never-ending story. But we can
also see that the issue of what is easier for a learner would be interesting to test by testing learners on
high-frequency roots, compounds and two-morpheme derivations.
Besides, the topical focus of texts influences which items that are introduced at which levels, which is a
predictable consequence of sequencing language education: learners first need to learn how to introduce
themselves and attend to their immediate needs, and gradually to lift their attention to the world around
them and topics that are no longer centered on learners (the hus-family and the var-family are two clear
examples of how central topics change over the proficiency levels in relation to learner needs) (as we
also see in the CEFR level can-do statements, of Europe (2018)).
Finally, there is hypothetically a good reason why most word families do not count compounds: com-
pounds add factors that are difficult to account for, such as exposure to the other families and influence
thereof, and since word families are often used in relation to frequency bands, compounds would be
complicated to include in the total frequency counts since they can combine a high-frequency and a low-
158
A1 A2 B1
matvara (‘food item’) märkesvara
varuhus vara (‘product, item’); mö- (‘branded item’) råvara (‘raw product’)
Coctaill
(‘store’) belvaruhus (‘furniture shop’) bytesvara (‘return item’) lyxvara (‘lux-
ury item’) matvaruhus (‘food store’)
Swell — — —
Table 4: var(a)-family
frequency word family. If we disregard the compounds in the lyx-family, the pattern will become the
simplex first, all words in the family appearing at the same level with the morphologically simplest one
among them (see Table 3, B2 level). Compounds are debated in cognitive research on morphology, some
studies positing that compounds are processed as whole-word units, whereas others show evidence that
access to constituent morphemes prior to the whole compound word ensures easiness of mental access
to the item (see review of the studies in Leminen et al. (2019)). Regardless, the suggestion to remove
compounds from the analysis of word families is not viable for Swedish, since compounding is the most
widely spread word formation mechanism (cf Svensson (2022)) and many new roots/words are learnt
initially from compounds, like vara from varuhus and lyx from lyxvara. In fact, even placenames, which
are made up of compounds, have been recognised to help L2 Swedish learners learn new roots, such as
torg (‘square’) from placenames like Rådmanstorget (Löfdahl et al., 2015).
To conclude, we have traced the order of learning the word lyx as follows:
hus (A1)→ varuhus (A1)→ vara (A2)→ lyxvara (B1)→ lyx (B2)
3 Conclusion and future work
We have shown that, using a descriptive resource like Swedish Word Family, it is possible to research
linguistic questions and to study language learning paths. However, there are many more application sce-
narios, for example, to utilize SweWF as a graded resource for Intelligent Computer-Assisted Language
Learning, and in particular for the area of Computer-Adaptive Language Testing, e.g. based on deriva-
tion patterns that are typical at later levels of development, for coining non-existent words for word
knowledge testing items; and many, many others.
Acknowledgements
This work has been supported by a research grant from the Swedish Riksbankens Jubileumsfond Devel-
opment of lexical and grammatical competences in immigrant Swedish, P17-0716:1, and by Nationella
språkbanken and HUMINFRA, both funded by the Swedish Research Council (2018-2024, contract
2017-00626; 2022-2024, contract 2021-00176) and their participating partner institutions.
References
R Harald Baayen, Richard Piepenbrock, & Leon Gulikers. 1996. The CELEX lexical database (CD-ROM). In
Linguistic Data Consortium. University of Pennsylvania.
Laurie Bauer & Paul Nation. 1993. Word families. International journal of Lexicography, 6(4):253–279.
Elfrieda H Hiebert, Amanda P Goodwin, & Gina N Cervetti. 2018. Core vocabulary: Its morphological content
and presence in exemplar texts. Reading Research Quarterly, 53(1):29–49.
Lívia Körtvélyessy, Alexandra Bagasheva, & Pavol Štekauer. 2020. Derivational networks across languages. De
Gruyter Mouton.
Mateusz Lango, Zdenek Zabokrtsky, & Magda Sevcikova. 2021. Semi-automatic construction of word-formation
networks. Language Resources and Evaluation, 55, 03.
159
Alina Leminen, Eva Smolka, Jon A Dunabeitia, & Christos Pliatsikas. 2019. Morphological processing in the
brain: The good (inflection), the bad (derivation) and the ugly (compounding). Cortex. Elsevier, 116:4–44.
M. Löfdahl, S. Tingsell, & L. Wenner. 2015. Lexikon, onomastikon och flerspråkighet [= Lexicon, onomasticon
and multilingualism]. In E. Aldrin, M. L. Gustafsson, M. Löfdahl, & L. Wenner, editors, Innovationer i namn
och namnmönster. NORNA-förlaget.
Alexandre Nikolaev, Sameer Ashaie, Merja Hallikainen, Tuomo Hänninen, Eve Higby, JungMoon Hyun, Minna
Lehtonen, & Hilkka Soininen. 2019. Effects of morphological family on word recognition in normal aging,
mild cognitive impairment, and Alzheimer’s disease. Cortex. Elsevier, 116:91–103.
Council of Europe. 2018. Common European Framework of Reference for Languages: Learning, Teaching, As-
sessment. Companion volume with new descriptors.
Anders Svensson. 2022. Tre av fyra nyord är substantiv [=Three out of four neologisms are nouns]. Språktidnin-
gen 2 Jan. 2022.
Elena Volodina, Yousuf Ali Mohammed, & Therese Lindström Tiedemann. 2021. CoDeRooMor: A new dataset
for non-inflectional morphology studies of Swedish. In Proceedings of the 23rd Nordic Conference on Compu-
tational Linguistics (NoDaLiDa), pages 178–189, Reykjavik (Online). Linköping University Electronic Press.
Britta Zeller, Jan Šnajder, & Sebastian Padó. 2013. DErivBase: Inducing and evaluating a derivational morphol-
ogy resource for German. In Proceedings of the 51st Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1201–1211.
160
Annotating the Narrative: A Plot of Scenes, Events, Characters
and Other Intriguing Elements
Mats Wirén1, Adam Ek2 and Murathan Kurfalı1
1Department of Linguistics 2Department of Philosophy, Linguistics
Stockholm University and Theory of Science
mats.wiren@ling.su.se University of Gothenburg
murathan.kurfali@ling.su.se adam.ek@gu.se
Abstract
Analysis of narrative structure in prose fiction is a field which is gaining increased attention in
NLP, and which potentially has many interesting and more far-reaching applications. This paper
provides a summary and motivation of two different but interrelated strands of work that we
have carried out in this field during the last years: on the one hand, principles and guidelines for
annotation, and on the other, methods for automatic annotation.
1 Introduction
This paper summarizes and motivates recent work of ours on analysing aspects of narrative structure in
prose fiction. If you (the reader) are Lars Borin, we hope that you will find this exposition interesting. We
think that it fits best with your interests in digital humanities and digital language infrastructure (Borin
et al., 2017), to which it is meant as a contribution. If you are not Lars Borin, we still hope that you
will find this exposition interesting. In particular, if we manage to convince you that analysis of narrative
structure is an excellent testbed not just for people interested in literature but for NLP in general, we will
be happy.
In terms of methodology, our work is rooted in linguistics and NLP, but we have also adopted relevant
concepts from narratology, the field in which narratives in all their forms are studied. A central notion
here is the distinction between three layers of abstraction (for which the terminology differs): a narrative
text as told by a narrator (and which is what the reader sees); the story, which corresponds to the chrono-
logical and causal sequence of events in the fictional world; and the discourse, which is the sequential
organization of the story by the narrator (with events typically presented in non-chronological order).
These three layers can be associated with a “who?”, a “what?” and a “how?”, respectively, and the ques-
tion “Who tells what, and how?” can thus be taken as an outline of the goal of narrative analysis (Jahn,
2021, Figure 2, page 17). Incidentally, this is a much harder question than the one typically associated
with traditional NLP, “Who did what to whom?”, something that we will return to below.
We have approached the analysis of narrative structure in two ways: First, by developing a series of
annotation guidelines and using them for successive manual annotations; and secondly, by developing
methods for automatic analysis based on our own annotations as well as those of others. In the following,
we will give an account of these two strands of work, and will end by discussing why we are doing this.
2 Annotation guidelines
Annotation of narrative phenomena in prose fiction (whether manual or automatic) is relatively recent
in computational linguistics; among the first approaches is Elson et al. (2010). In contemporary narratol-
ogy, annotation barely plays any role at all (Reiter et al., 2019a, page 17). However, a recent event which
spurred interest in this area was the Shared Task on Systematic Analysis of Narrative Texts through An-
notation (SANTA), first announced in 2017 and organized in Germany (Reiter et al., 2019b; Reiter et
al., 2019a). The format of SANTA emulated a shared task in NLP, but the difference was that what was
compared was not systems but annotation guidelines. More specifically, the task in SANTA was annota-
tion guidelines for narrative levels, a term introduced by Genette (1983), meaning roughly subordination
Mats Wirén, Adam Ek and Murathan Kurfalı. 2022. Annotating the narrative: A plot of scenes,
events, characters and other intriguing elements. In Volodina, Dannélls, Berdicevskis, Forsberg
and Virk (editors), Live and Learn – Festschrift in honor of Lars Borin, pages 161–166. Avail-
able under CC BY 4.0 161
in the sense of stories in stories, or phenomena such as direct speech between characters which can be
thought of as a form of embedding (of quotations) relative to the narrator. The organizers provided a
selection of theoretical material as a background, but did not recommend any particular narratological
approach; rather, the participants were given free hands in terms of theoretical assumptions.
Annotation of narrative structure is much like annotation in NLP and should fulfil the same basic
objectives (or so we will assume). In particular, it should capture the crucial aspects of the phenomena
targeted, and being simple enough both for annotators to perform and for machines to learn. However,
it has two distinctive characteristics. First, narrative categories may cover very large portions of the
text, in contrast to syntactic annotation and others which are typically constrained to the sentence level.
Furthermore, the context relevant for determining a category can also be very large. In these respects, it is
similar to co-reference annotation, which is generally seen as a document-level task. Secondly, annotators
typically need to decide not just on the categories themselves but also on the spans of these categories.
This is reminiscent of named entities, for which annotators first need to identify the segment that it spans
and then decide on its category.1
Guidelines for the first round of SANTAwere submitted in June 2018. Eight groups fromGermany, Ire-
land, Canada, the U.S. and Sweden (us) made submissions. The participation was broad, with computer
scientists, computational linguists and literary scholars. In September 2018, the groups convened for a
workshop in Hamburg where each of the guidelines were presented, discussed and ultimately ranked. To
this end, the guidelines were evaluated both qualitatively and quantitatively. The former included concep-
tual coverage (the extent of the theoretical underpinnings), applicability (usability for annotators), and
usefulness (roughly, helpfulness for the purpose of understanding the narrative structure). The quanti-
tative measure was inter-annotator agreement. This was based on parallel annotations made prior to the
workshop by letting participants apply their own guideline, someone else’s guideline, and having a group
of students (supervised by the organizers) annotate for everyone. The guidelines, expert reviews of these
administered by the organizers, and the evaluation results were published in a special issue of Journal
of Cultural Analytics (Gius et al., 2019; Willand et al., 2019); our guideline is included as Wirén et al.
(2019).
Development of annotation guidelines is an inherently iterative process. To take advantage of the
insights gathered in the first round, the organizers decided to do a second round in which the groups
were offered to submit revised guidelines, in May 2019. Although it was not possible to hold a second
workshop, the organizers administered new parallel annotations according to the revised guidelines, and
made a quantitative evaluation of these. The results and final conclusions were published in another
special issue of Journal of Cultural Analytics (Gius et al., 2021).
Overall inter-annotator agreement improved between round 1 and 2. Also, whereas the best score in
round 1 was 0.30 (ours: 0.23), the best score in round 2 was 0.46 (which was our guideline). These scores
were measured using the γ (gamma) metric (Mathet et al., 2015), where 1 means no disagreements and
0 corresponds to random agreement. The γ metric was chosen since it takes care of agreement both with
respect to categories and spans, as mentioned in Section 1. In summary, the 0.46 score is not a great
result, but it was considered acceptable since this was a new task, and it was at least substantially higher
than the best agreement in the first round.
A breakdown of our annotation scheme is shown in Table 1; it is further described in Wirén & Ek
(2021). The tagset is hierarchically structured in four layers, ordered by an inclusion relation. The scheme
covers voice, that is, whether the narrator is ever present in the story or not (Genette, 1983, Chapter 5);
focalization, how much information the narrator has access to (Genette, 1983, page 189 ff.); and identifi-
cation of passages told by the narrator and passages containing the characters’ direct speech, respectively.
In addition, the scheme allows for embeddings among these latter two kinds of passages (stories in stories,
etc.). Our annotation of fictional dialogue is relatively fine-grained. We distinguish between turns and
1It might be argued that, in distinction to NLP annotation, the graphic formatting of printed prose fiction would tell us
something about narrative structure; for example, turn changes among speakers in fictional dialogue seem to regularly be
accompanied by new paragraphs. However, conventions for this vary wildly, and a basic assumption in SANTA is that narrative
annotation should be based solely on the contents of the text, and not on formatting devices such as text structure in TEI XML
or paragraphs or chapters in a printed book (Reiter et al., 2019a, page 15).
162
Table 1: Hierarchical structure of the annotation scheme.
Layer Tag Description
1 <VOICE_1>, <VOICE_3> Narrator’s presence in the story
2 <FOC_UNR>, <FOC_INT>, Perspective of the narrator
<FOC_EXT>
3 <NARRATOR> Narrator’s discourse
4 <CHARACTERS> Characters’ discourse
4.1 <TURN> Turn = one or several lines with the same speaker
4.1.1 <Speaker– Line = one or several utterances with
Addressee> the same speaker and the same
addressee, and tagged with these
4.1.1.1 <NC> Speech-framing construction
lines, and annotate the identities of speakers and addressees. We also annotate speech-framing construc-
tions, which provide the narrators cues about the circumstances of the speech as opposed to the speech
itself (somewhat related to the notion of speech-framing expressions in Caballero & Paradis (2017)).
3 Automatic annotation
The SANTA shared task had a twofold aim: increasing the understanding of narrative levels, and gen-
erating annotated data for the purpose of machine learning. (The plan is to have a second, follow-up
shared task which will be devoted to automatic annotation of narrative levels based on annotations from
SANTA.) However, even before SANTA we began to explore automatic annotation of narrative structure.
Our interest in this began with direct speech, an independent narrative mode which has an interesting
“double orientation” (Koivisto & Nykänen, 2016): it can be understood both in relation to our experi-
ences of real-life conversations (by mimicking aspects of this) and in relation to the fictional world.
In the first system that we constructed, by Ek et al. (2018), our goal was to keep track of the identities
of speakers and addressees, as this is one key aspect of the structure of a story. To this end, we used an
averaged perceptron with handcrafted features for both speakers and addressees. The system relied on
three contexts: (i) the passage of direct speech in which we want to identify the speakers and addressees;
(ii) the narrative passage immediately preceding this; and (iii) a global context consisting of all dialogue
and narration preceding these. We used information about the frequency with which different characters
occurred in all these, as well as mention with speech verbs, such as “. . . said Adam”. We compared our
system against three baselines and found that it outperformed all of them. Another finding was that the
system was better at predicting speakers than addressees.
In a spin-off from this work, Ek & Wirén (2019) were interested in identifying speech-framing con-
structions as mentioned in Section 2, in effect elements of narration appearing inside passages of dia-
logue. For example, in “Machine learning is fun, said Adam, watching the audience gleefully”, the part
“said Adam, watching the audience gleefully” is the narrators cues about the circumstances of the speech
(including a speech tag indicating the identity of the speaker). Our model used logistic regression with
handcrafted features, mainly various syntactic cues. Since this was a new task, there were no previous
systems to compare with, but the system outperformed three baselines with a large margin.
Kurfalı & Wirén (2020) described an attempt towards a generalized solution to this task, applicable to
a multitude of languages. Operating under a low-resource assumption of complete absence of manually
labelled data, we firstly devised a set of heuristics to identify the cases in a large book corpus where quo-
tation marks are used sufficiently consistently to automatically elicit enough training data. Then, as for
classifier, we replaced the linguistics features, which we cannot assume to be available across languages,
with the multilingual contextual embeddings to arrive at a feature-independent multilingual model. The
results, obtained on manually annotated datasets in four different languages (including Swedish, with
data provided by Stymne & Östman (2020)), suggest that the proposed methodology achieves compara-
ble results to the supervised monolingual baselines.
163
A different kind of problem that we have worked on is segmentation of scenes. This concept was
introduced in narratology by Genette (1983), where the basic idea was that in a scene, the amounts of
time in the story and the discourse are proportional to each other. Put differently, the narration in a scene
should be roughly chronological with a uniform pace, without time leaps, flashbacks or other temporal
discontinuities. Scenes are basic building blocks of a narrative discourse, and are in turn composed of
events, considered to be the smallest spatiotemporal units.
Kurfalı & Wirén (2021) explored this problem through participation in the Shared Task on Scene
Segmentation (STSS) (Zehe et al., 2021). To obtain an operational definition of a scene, the organizers
had adopted a more general definition than above which also involved the set of characters (which should
be stable), space (largely unchanged) and action (coherent and continuous). An extensive annotation
guideline had been developed to make these notions more precise, and a corpus of 15 dime novels in
German had been annotated. In addition to scenes, there are also non-scenes, typically in the form of
summaries where the progress of time is compressed, though sequences of non-scenes were not targeted
here. The task of scene segmentation thus consisted in dividing a text into scenes and non-scenes and
labelling them accordingly. An operationalization of this which we adopted is to identify and label scene
transitions of three types: SCENE–SCENE, SCENE–NONSCENE and NONSCENE–SCENE.
We modelled scene segmentation as a sequence classification task, similar to named-entity recognition
or part-of-speech tagging, with the difference that sentences are the target linguistic units rather than
tokens. Following Cohan et al. (2019), a sequence of N sentences was concatenated by BERTs special
delimiter token [SEP], which was used for representing the sentence it preceded. The best F1 score
obtained in the shared task was 0.37 (which was our system; the second-best system had 0.16). Our
system was comparatively successful in detecting scene-to-scene transitions; yet, it completely failed in
discovering the boundaries between non-scenes and scenes. Overall, the result was not great, but it was
expected that this new task would be difficult. Also, the F1 score is very strict in the sense that it counts
a scene boundary as correct only if it is predicted at exactly the right position, whereas an offset of even
a single sentence is counted as a complete miss.
4 Discussion
We have found it very fruitful to engage in work (especially the shared tasks) on both guidelines and
methods for annotation of narrative discourse, as we have been able to transfer ideas from the one to
the other. For example, our detailed annotation for direct speech has inspired our work on methods for
automating this kind of analysis. In general, we are motivated both by an interest in trying to make
narratological concepts more precise and in applications of this kind of analysis, whether in literary
studies or in other areas where story-telling may play a role (journalism, history, courtroom interactions,
to just take a few examples).
It was suggested in Section 1 that the question “Who tells what, and how?” is much harder to answer
than “Who did what to whom?”. This is in fact also an important reason why we are pursuing this line
of work: we believe that analysis of narrative structure provides an excellent testbed for state-of-the-art
NLP technology. The Shared Task on Scene Segmentation was a demonstration of this: although our
system outperformed the other systems, the overall results were not impressive, thus giving a sense of
the big challenges posed by problems like this.
Scene segmentation is a crucial task since it concerns the basic building blocks of a narrative discourse.
However, out of the tasks that we have explored so far, we believe that it is the one with the largest room
for improvement, as suggested by the gap between the current state-of-the-art and human performance.
Despite several different architectures based on language models, they seem to fall short of achieving
acceptable performance unaided. One possible source of aid can come from the task of event detection
on the assumption that scenes and non-scenes consist of different event types (for example non-scenes
can be expected to contain more state-like events).
164
References
Lars Borin, Nina Tahmasebi, Elena Volodina, Stefan Ekman, Caspar Jordan, Jon Viklund, Beáta Megyesi, Jesper
Näsman, Anne Palmér, Mats Wirén, Kristina Björkenstam, Gintarė Grigonytė, Sofia Gustafson Capková, &
Tomasz Kosiski. 2017. Swe-Clarin: Language resources and technology for Digital Humanities. In Digital
Humanities 2016. Extended Papers of the International Symposium on Digital Humanities (DH 2016) Växjö.
Edited by Koraljka Golub, Marcelo Milra. Vol-2021, Aachen. M. Jeusfeld c/o Redaktion Sun SITE, Informatik
V.
Rosario Caballero & Carita Paradis. 2017. Verbs in speech framing expressions: Comparing English and Spanish.
Journal of Linguistics, 54(1):45–84.
Arman Cohan, Iz Beltagy, Daniel King, Bhavana Dalvi, & Daniel S Weld. 2019. Pretrained language models
for sequential sentence classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-
IJCNLP), pages 3693–3699.
Adam Ek & Mats Wirén. 2019. Distinguishing narration and speech in prose fiction dialogues. In Proceedings of
the Digital Humanities in the Nordic Countries 4th Conference, pages 124–132, Copenhagen.
Adam Ek, Mats Wirén, Robert Östling, Kristina Nilsson Björkenstam, Gintarė Grigonytė, & Sofia Gustafson-
Capková. 2018. Identifying speakers and addressees in dialogues extracted from literary fiction. In Proceedings
of the 11th edition of the Language Resources and Evaluation Conference (LREC), Miyazaki.
David Elson, Nicholas Dames, & Kathleen McKeown. 2010. Extracting social networks from literary fiction.
In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 138–147,
Uppsala. Association for Computational Linguistics.
Gérard Genette. 1983. Narrative Discourse: An Essay in Method. Cornell University Press.
Evelyn Gius, Nils Reiter, & Marcus Willand. 2019. A Shared Task for the Digital Humanities. Chapter 2: Evalu-
ating Annotation Guidelines. Journal of Cultural Analytics, 4(3):1–11.
Evelyn Gius, Marcus Willand, & Nils Reiter. 2021. On Organizing a Shared Task for the Digital Humanities —
Conclusions and Future Paths. Journal of Cultural Analytics, 6(4):1–28.
Manfred Jahn. 2021. Narratology 2.3: A Guide to the Theory of Narrative. English Department, University of
Cologne, Cologne.
Aino Koivisto & Elise Nykänen. 2016. Introduction: Approaches to fictional dialogue. International Journal of
Literary Linguistics, 5.
Murathan Kurfalı & Mats Wirén. 2020. Zero-shot cross-lingual identification of direct speech using distant su-
pervision. In Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural
Heritage, Social Sciences, Humanities and Literature, pages 105–111, Online. International Committee on Com-
putational Linguistics.
Murathan Kurfalı & Mats Wirén. 2021. Breaking the narrative: Scene segmentation through sequential sentence
classification. In Proceedings of the Shared Task on Scene Segmentation, co-located with the 17th Conference
on Natural Language Processing (KONVENS), volume 3001, http://ceur-ws.org/Vol-3001/, pages 49–
53, Düsseldorf.
Yann Mathet, Antoine Widlöcher, & Jean-Philippe Métivier. 2015. The unified and holistic method gamma (γ) for
inter-annotator agreement measure and alignment. Computational Linguistics, 41(3):437–479.
Nils Reiter, Marcus Willand, & Evelyn Gius. 2019a. A Shared Task for the Digital Humanities. Chapter 1:
Introduction to Annotation, Narrative Levels and Shared Tasks. Journal of Cultural Analytics, 4(3):1–24.
Nils Reiter, Marcus Willand, & Evelyn Gius. 2019b. Foreword to the Special Issue "A Shared Task for the Digital
Humanities: Annotating Narrative Levels". Journal of Cultural Analytics, 4(3):1–4.
Sara Stymne & Carin Östman. 2020. SLäNDa: An annotated corpus of narrative and dialogue in Swedish literary
fiction. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC), pages 826–834,
Marseille. European Language Resources Association.
Marcus Willand, Evelyn Gius, & Nils Reiter. 2019. A Shared Task for the Digital Humanities. Chapter 3: Descrip-
tion of Submitted Guidelines and Final Evaluation Results. Journal of Cultural Analytics, 4(3):1–15.
165
Mats Wirén & Adam Ek. 2021. Annotation Guideline No. 7 (revised): Guidelines for annotation of narrative
structure. Journal of Cultural Analytics, 6(4):164–186.
Mats Wirén, Adam Ek, & Anna Kasaty. 2019. Annotation Guideline No. 7: Guidelines for annotation of narrative
structure. Journal of Cultural Analytics, 4(3):1–22.
Albin Zehe, Leonard Konle, Lea Katharina Dümpelmann, Evelyn Gius, Andreas Hotho, Fotis Jannidis, Lucas
Kaufmann, Markus Krug, Frank Puppe, Nils Reiter, et al. 2021. Detecting scenes in fiction: A new segmentation
task. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational
Linguistics: Main Volume, pages 3167–3177.
166
The other SAT-Solver:
Applying lexicons to SweSAT word questions
Niklas Zechner
Språkbanken Text
Gothenburg University
niklas.zechner@gu.se
Abstract
SweSAT synonyms is a collection of questions from the Swedish SAT (högskoleprovet), and
part of the SuperLim suite of machine learning datasets. Each question consists of one word or
short phrase, and five possible explanations, each of which is also a word or short phrase. In
this study, two different lexicons are applied in trying to answer to these questions. The first is
Bring’s Thesaurus, in a version partly updated from the 1930 original. The second is SALDO, a
Swedish association lexicon. We find that although coverage is limited by the presence of multi-
word expressions, each is reasonably accurate for words where a match is found in the lexicon,
and by combining them we get an overall accuracy of 47% when counting all words.
1 Introduction
Understanding synonyms is an important part of natural language understanding. As an example of this,
SuperLim provides questions from the word section of the Swedish SATs, where the task is to identify
which of five words or expressions are synonymous with one given word or expression. We attempt to
do that by using two different lexicons, with different types of associations between words.
2 Data
2.1 SuperLim
The SuperLim suite (Adesam et al., 2020) consists of (so far) 13 datasets for evaluation of language
understanding models in Swedish. One of those is SweSAT Synonyms, based on questions from the
Swedish SAT (högskoleprovet). For each question, there is one focus word, and five alternatives to choose
from. The focus word and the alternatives can be single words, or short expressions, and the task is to
find the alternative which is a synonym or alternative for the focus word.
One purpose of SuperLim is to provide a unified test set for several language understanding tasks in
Swedish. Since the resource is relatively new, there are few results to compare with yet, and the only
attempt we find for the SweSAT task is that of Rekathati (2021).
2.2 Bring
Svenskt ordförråd ordnat i begreppsklasser is an adaptation to Swedish of Roget’s Thesaurus, written by
S. C. Bring in 1930. In a digitised and modernised version, it is provided by Språkbanken as Blingbring
(here used in version 0.3) (Borin et al., 2014). It contains just over 1000 semantic categories, each with a
number of subcategories. The subcategories are either noun, verb, or adjective categories. The categories,
subcategories, and words are all ordered so that the most similar words should be close together.
Niklas Zechner. 2022. The other SAT-Solver: Applying lexicons to SweSAT word questions.
In Volodina, Dannélls, Berdicevskis, Forsberg and Virk (editors), Live and Learn – Festschrift
in honor of Lars Borin, pages 167–169. Available under CC BY 4.0 167
2.3 Saldo
Saldo (Swedish Associative Thesaurus, here used in version 2.3) (Borin et al., 2013) is an electronic
lexicon resource for modern Swedish. For each entry, it contains links to a primary descriptor, a more
basic word considered an “ancestor sense” of the word in question. Many words also have one or more
secondary descriptors, other words whose associations give a better idea of the semantics of the word in
question. Since the descriptors are always words considered more basic, the full set of words form a tree
(using the primary descriptor) or a directed acyclic graph (using all descriptors), leading back to a single
pseudo-word root node.
3 Experiment
The SweSAT dataset contains 822 questions, each with five alternatives, only one correct. Random guess-
ing would therefore have an expected accuracy of 20%. Taking Rekathati (2021) as the state of the art, it
reports an accuracy of 42.82%.
3.1 Bring
For each question, we look up the focus word in Bring, and compare with each of the options. If there
is an option word which appears in the same subcategory as the focus word, we choose that. Otherwise,
we look at the larger categories. If there is also no match, we choose the option which is closest to the
focus word in the entire word list. In cases where more than one option word is found in a category, we
pick the first match (so effectively at random). In some cases, a word appears in more than one category
in Bring; we count a match if both words appear together in any category.
Out of the 822 focus words, 258 (31%) did not appear in Bring. Many of these words are adjectives,
which seem to be underrepresented in Bring, and many are phrases. For 88 words (11%), although the
focus word appeared in Bring, none of the option words did.
For 243 words (30%), a match was found in a subcategory. Of those, 192 (79%) were correct.
For 61 words (7%), a match was found in a large category. Of those, 26 were correct (43%), so still
much better than random, and in fact on par with the state of the art.
For 172 words (21%), matches were found in Bring, but none within the same large category. In
principle, the categories in Bring are arranged so that more similar categories are close, which should
mean that we can find the most likely option by looking at the closest word. This gives only 25 correct
answers (15%), considerably worse than random. This might be a coincidence, but we can also speculate
about a possible explanation: For option words which are found in Bring, words that are actually similar
are likely to share a category, so if the word is found but in a different category, that word is less likely
to be correct than a word which is not found at all. If we flip the strategy around, and instead choose an
absent word over one which is present, we get 46 correct answers (27%). Better than random, but not a
particularly reliable method.
This means that in total, if we were to give an answer to each question, using the absent-above-distant
method, and assuming that a random guess has a 20% chance, we would get 333 correct (41%), just shy
of the state of the art. If we only consider questions where there is a match in a subcategory, we get 79%
correct while answering 30% of the questions, which is at least a useful accuracy. If we include those
with a match in a large category, we get 72% correct out of 37% answered.
3.2 Saldo
In Saldo, entries can be considered nodes in a graph, where each entry is connected to its descriptors.
To see the distance between two words, we traverse through its list of ancestors (i.e. descriptors, the
descriptors’ descriptors, etc.) until we find one shared by both words. For each question, we choose the
option word which is closest to the focus word.
We find 291 correct (35%), 330 incorrect (40%), 96 (12%) where the focus word does not appear in
Saldo, and 105 (13%) where the focus word but none of the options appear in Saldo. This means that
if we only consider questions where a match is found, we get 47% correct while answering 76% of the
questions. If we guess at random for the unknown questions, that would give an accuracy of 40%.
168
3.3 Simple tricks
A common piece of advice for multiple-choice questions is to “always guess C”. How well would that
work here? It turns out 164 of the answers were option C, which is 20% (as close as we can get, with
822 questions). The most common answer in this dataset was A, at 172; the difference from random is
not statistically significant.
Another trick is to avoid the same answer twice in a row. The dataset contains information on which
test the question comes from, so we rule out all first questions and look at how many of the remaining
have the same correct option as the previous question. The result is 139 of 773 (18%), which may look
meaningful, but is not statistically significant.
Judging from personal experience, it seems that the correct answer is often longer than the others. Is
this a real effect, or just coincidence and confirmation bias?We try applying the “longest answer” method,
and find that it gives 197 correct answers (24%). In this case, the difference from random guessing is
statistically significant (p < 5%).
3.4 Combining methods
Since the two lexicons find different words, we can put them together to find a greater number of matches
and hopefully a better accuracy. Because Bring has a higher accuracy but lower coverage, it makes sense
to start with. Then we apply Saldo to the remaining words. For those words left unidentified by both
Bring and Saldo, we use the longest-word method.
Bring, as before, gets 218 correct out of 822, leaving 518. Saldo finds an answer to 321 of those, of
which 117 are correct. That leaves 197. We pick the longest option on those, and get 49 correct. This all
adds up to 284 correct (47%).
4 Conclusion
When S. C. Bring wrote his thesaurus, he was 88 years old. His work has now surpassed that age, but it is
still a very useful resource in tasks like this. Using it, we are not able to find all the words and expressions
in the material, but for those where we do, the accuracy is quite encouraging. Saldo, on the other hand,
a more modern style of lexicon, is able to find a considerably larger number of words, but with a much
lesser accuracy. It seems likely that other, similar lexicons, perhaps with different types of associations,
may be able to reach higher accuracies. By combining both, we are able to reach an accuracy of 47%,
improving on the previously best known result, using a fast, lightweight, and transparent method.
References
Yvonne Adesam, Aleksandrs Berdicevskis, & Felix Morger. 2020. SwedishGLUE – towards a Swedish test set
for evaluating natural language understanding models. Technical report, University of Gothenburg.
Lars Borin, Markus Forsberg, & Lennart Lönngren. 2013. Saldo: a touch of yin to wordnet’s yang. Language
resources and evaluation, 47(4):1191–1211.
Lars Borin, Jens Allwood, & Gerard de Melo. 2014. Bring vs. mtroget: Evaluating automatic thesaurus translation.
In Proceedings of LREC 2014, May 26-31, 2014 Reykjavik, Iceland. European Language Resources Association.
Faton Rekathati. 2021. The KBLab Blog: Introducing a Swedish sentence transformer. Kungliga biblioteket,
August, 23.
169

Mot en mänskligare maskinöversättning
Robert Östling
Institutionen för lingvistik
Stockholms universitet
robert@ling.su.se
Abstract
Over the lifetime of Lars Borin, machine translation has made a gigantic leap – from simple rule-
based systems residing on vacuum tube computers, to the latest zero-shot translation systems. The
amount of text data used by modern systems can reach hundreds of billions of words, but is this
really necessary? What is the lower limit on training data for a translation system? Here I suggest
a simple experiment, entirely without computers, that could go some way towards answering this
question.
1 Introduktion
I forntiden, över tre år före Lars födelse, utfördes det första experimentet med automatisk maskinöver-
sättning. På en dator stor som ett rum översattes några väl valda meningar från ryska till engelska, och
datorlingvistiken var född.
Det var då. I detta nådens år 65 efter Lars födelse, kan vi mata in en massa text i en dator stor som ett
rum, och några gigawattimmar senare få ut ett program som översätter – om vi ber den på rätt sätt. Vissa
tolkar detta som datorlingvistikens död, men den frågan får vi återkomma till vid annat tillfälle.
Vad krävs egentligen för att översätta från ett språk till ett annat? Några hundra miljarder ord av
språkröra från internet räcker tydligen gott, men nog borde det väl gå att klara sig med betydligt mindre?
Den här frågan kan och har utforskats inom olika projekt för maskinöversättning av lågresursspråk, men
tolkningen av negativa resultat ställer till problem. Även om ett visst experiment misslyckades med att
producera en godtagbar översättning, hur vet vi att det inte finns någon annan algoritm som skulle ha
klarat uppgiften?
2 Metod
I det här fallet kan det vara lättare att jobba med människor, allra helst någon lingvistiskt skolad. Metoden
är enkel: sätt människan med en parallell text på hennes modersmål samt ett annat språk, och be henne
sedan att översätta meningar från en annan källa mellan språken. Om hon trots tid och ansträngningar
misslyckas, så beror det förmodligen på att parallelltexten inte är tillräckligt lång och/eller varierad för
att uppgiften ska gå att lösa.
Genom en sådan studie skulle vi kunna etablera en mänsklig baslinje för lågresursöversättning, och
min gissning är att det vore svårt för en dator att göra speciellt mycket bättre ifrån sig. Olyckligtvis har jag
inte tillgång till den mängd uttråkade lingvister som skulle krävas, men vi kan approximera översättning
från modersmålet till det okända språket med en enklare metod. Eftersom allt vi vet om målspråket
kommer från parallelltexten, är vi begränsade till de ord och konstruktioner som förekommer i texten.
Under (det något optimistiska) antagandet att vår försöksperson lyckas korrekt tolka varje språklig detalj
i parallelltexten, reduceras uppgiften sedan till att pussla ihop meningar genom att enbart använda de ord
och konstruktioner som förekommer i parallelltexten.
Robert Östling. 2022. Mot en mänskligare maskinöversättning. In Volodina, Dannélls,
Berdicevskis, Forsberg and Virk (editors), Live and Learn – Festschrift in honor of Lars Borin,
pages 171–173. Available under CC BY 4.0 171
Text Källa Konstruktion
de förlorade fåren Matt 10:6 nominalfras i pluralis, bestämd form med adjektiv
nyligen ville judarna stena dig Joh 11:8 lexikon
under förgångna släktens tider Ef 3:5 lexikon
mycket stor glädje Matt 2:10 modifiering av adjektiv
de fyrtio åren i öknen Apg 7:42 lexikon
ständigt, var dag, voro de Apg 2:46 lexikon
hört konungens ord Matt 2:9 lexikon
skall edert tal vara Matt 5:37 lexikon
talen I eder emellan om att Matt 16:8 bisatsinledning för diskussionsämne
det bliver klart väder Matt 16:2 lexikon
förändra de stadgar Apg 6:14 lexikon
kallas Matt 1:16 passivsuffix
för . . .har han förkortat Mark 10:20 verb i perfekt efter adverbial (V2)
kom i skarpt ordskifte med dem Apg 15:2 lexikon
bland sig utvälja några män Apg 15:22 lexikon
de äro blinda ledare Matt 15:14 lexikon
österns länder Matt 2:1 lexikon
sydväst och nordväst Apg 27:12 lexikon
var konung över Judeen Luk 1:5 ledare för ett land
Tabell 1: Källor till ord, fraser och konstruktioner i Nya Testamentet.
Vi illustrerar detta med hjälp av den flitigast översatta texten, det Nya Testamentet.1 Ett verktyg som
producerar acceptabla översättningar enbart baserat på denna korta text skulle innebära att maskinöver-
sättning blir möjligt för tusentals nya språk.
Som exempel plockar vi en mening på måfå från dagens tidning:
De senaste åren har klimatfrågan dominerat i de nordiska grannländernas valkampanjer.
(Dagens Nyheter, 2022-09-01)
Som moderna läsare kan vi förstå och förklara vad den här meningen betyder, så nästa steg är att leta
efter ord och konstruktioner som kan användas för att uttrycka detta. Vi väljer i det här fallet Nya Testa-
mentet från 1917 års bibelöversättning som vår parallelltext, så uppgiften är nu att uttrycka betydelsen
hos ovanstående mening enbart med hjälp av språkligt material från denna text.
Tabell 1 visar de källor i Nya Testamentet som används. Jag har, utan att leta överdrivet noga, försökt
att använda den första lämpliga förekomsten av ett visst ord, fras eller grammatisk konstruktion.
Vi börjar med nominalfrasen ”de senaste åren”. Bestämd form, pluralis, modifierat med ett adjektiv.
Texten innehåller ”de förlorade fåren” (Matt 10:6), som visar att vi kan använda samma ordföljd och arti-
kel (”de”). Nästa steg är att leta lite lexikalt material. För ”senaste” är det närmaste jag hittar ”förgångna”
(Ef 3:5) som modifierar ett substantiv (”förgångna släktens tider”). Adverbet ”nyligen” (Joh 11:8) kan
modifiera ”förgångna” för att komma närmare betydelsen hos ”senaste”, och detta görs (som i Matt 2:10)
genom att sätta adverbet direkt före adjektivet. Till sist har vi ”åren”, som förekommer i texten. Nu har
vi en början på översättningen:
De nyligen förgångna åren . . .
1Lars har för övrigt sett till att Gustav Vasas bibelöversättning nu finns i Språkbankens valv. För detta är vi som arbetar med
Bibeln som parallellkorpus tacksamma!
172
Jag kommer inte att gå igenom resten av meningen i lika stor detalj, läsaren kan själv lägga pusslet
med hjälp av tabellen. Några kommentarer kan däremot behövas om koncept som var okända under
biblisk tid. ”Klimatfrågan” är ett bra exempel. Här kan vi uppenbarligen inte hitta ett passande lexem i
Nya Testamentet, men en översättare är naturligtvis fri att parafrasera svåröversatta stycken. Jag har här
valt att översätta ”har klimatfrågan dominerat” med ”har vi ständigt hört tal om att väder förändras”. I
princip skulle ett maskinöversättningssystem kunna göra samma sak, även om det ställer höga krav på
omvärldskunskap och språkgenereringsförmåga.
En annan svår nöt är hur ”valkampanjer” kan uttryckas helt utan den vokabulär som hör till en modern
representativ demokrati. Det närmaste jag kunde hitta var det inte helt exakta och något ålderdomliga
”ordskiftet om att utvälja ledare”. Själva ordet ”ledare” förekommer enbart i en mer bokstavlig betydelse
av en blind ledare som leder en annan blind ner i en grop (Matt 15:14). Även om den liknelsen ibland kan
tyckas passande under valkampanjer så kanske en annan översättare hade föredragit ”konungar” (Matt
10:18) eller ”härskare” (Luk 22:25).
Slutresultatet blir som följer, och jag överlåter till läsaren att bedöma hur väl det förmedlar betydelsen
i originalmeningen:
De nyligen förgångna åren har vi ständigt hört tal om att väder förändras, i ordskiftet om att
utvälja ledare över nordens länder.
För att återknyta till frågan om vilken storlek på träningsmaterial som krävs, kan vi studera kolumnen
Källa i Tabell 1. Större delen av materialet förkommer redan i den första boken, Matteusevangeliet
(Matt), och nästan allt i något av de fem första böckerna (Matt, Mark, Luk, Joh, Apg) som tillsammans
utgör omkring 100 000 ord i den här översättningen. Som jämförelse är det bara 0,2% av textmängden i
Europarl-korpusen, vilken i sig får räknas som en parallellkorpus av medelstorlek med nutida mått mätt.
3 Slutsats
Även en relativt kort text innehåller tillräckligt stor vokabulär och språkliga konstruktioner för att, med
viss parafrasering, uttrycka meningar från en helt annan tid och domän. Det innebär alltså inga hinder
i princip mot att konstruera ett verktyg för maskinöversättning till de omkring 2000 språk som Nya
Testamentet finns översatt till. Vi har här antagit tillgång till en komplett och korrekt lingvistisk analys
av parallelltexten på målspråket, och nästa steg vore att utforska i vilken utsträckning det går att uppnå i
praktiken. Allt som behövs är en lingvist med mycket tid.
173
174
Acknowledgments
We are grateful to all the authors of the volume who took time to sign the best type of congratulation cards
to Lars – in the form of the articles. Special thanks go to Gerlof Bouma for all his help with preparing
this festschrift for publication, and to all people who helped us translate the phrase “Live and learn”
into their languages. Thanks to Karin Wenzelberg who has approached the task of creating the cover
with professional creativity. We would also like to express appreciation for Shalom Lappin and Bernard
Comrie for their help with earlier versions of the manuscript template.
The work has been supported by Nationella språkbanken and HUMINFRA, both funded by the
Swedish Research Council (contracts 2017-00626 and 2021-00176). This funding allowed us to perform
part of the work on the volume during office hours. Finally, we thank our families who had to tolerate
our absence in the evenings and weekends in connection to the “secret work” on the Festschrift.
We did our best to reach out to as many of Lars’ colleagues and friends as possible (and as secretly
from Lars as possible), but his network is as endless as his research interests, and we deeply apologize
for not being able to include everyone.
GU-ISS, Forskningsrapporter från Institutionen för svenska, flerspråkighet och 
språkteknologi, är en oregelbundet utkommande serie, som i enkel form möjlig-
gör spridning av institutionens skriftliga produktion. Det främsta syftet med se-
rien är att fungera som en kanal för preliminära texter som kan bearbetas vidare 
för en slutgiltig publicering. Varje enskild författare ansvarar för sitt bidrag.
GU-ISS, Research reports from the Department of Swedish, Multilingualism, 
Language Technology is an irregular report series intended as a rapid prelimina-
ry publication forum for research results which may later be published in fuller 
form elsewhere. The sole responsibility for the content and form of each text 
rests with its author.