Extraction of Ontological Information from Corpora (and Lexicon)

Slides:



Advertisements
Liknande presentationer
För att göra avklippta hörn på en bild använder man sig av verktyget Picture Shape. Detta verktyg hittar du under fliken Picture Tools (som du får upp.
Advertisements

Att visa fotnot, datum, sidnummer Klicka på fliken ”Infoga”och klicka på ikonen sidhuvud/sidfot Klistra in text: Klistra in texten, klicka på ikonen (Ctrl),
Hardware Hacking, part 1 SMB Soldering. Agenda • 17:30 Introduktion - Introduction to hardware hacking, part 1 - SMB soldering – Vilka är vi och vad gör.
Mellanblå fält till höger: Plats för bild – foto, diagram, film, andra illustrationer Comparison of some instruments and methods for determination of sunshine.
You should put a comma before a person’s name if you're talking directly to them… Come here, Lily! …or when you are introducing or talking about a person.
Svensk pluralbildning
BENÄMNA lätta ord SPRÅKTRÄNING VID AFASIKg VIII
© Apoteket AB Sidhuvud med plats för gemensamt namn för OH-serien Sidhuvud med plats för Enhet / Utförare – Internt Swedish community pharmacy classification.
1.Numerical differentiation and quadrature Discrete differentiation and integration Ordinary.
Landscaped Spaces Design for Health This slide show contains images related to health and the built environment. For more information see
Aims and outcomes Levnadsvillkor, attityder, värderingar och traditioner samt sociala, politiska och kulturella förhållanden i olika sammanhang och delar.
Kvantitetsord.
Get more efficient use of IFS Application with
Fråga 71 Hål är minoritetsbärare i ett n-typ kisel lager. Hålen injeceras från en sida och diffunderar in i n-typ lagret och en koncentrationsprofil upprättshåls.
Vägledningscentrum Career guidance centre
Workshop 7 mars 2013 Välkomna Dagens tema: Crowdsourcing Dagens talare 7/3/13 Behovsdriven utveckling i praktiken 1.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Exercise 2 Mälardalen University 2007.
Motivation Terese Stenfors Motivation Vad är det? –Motivation is concerned with our movements or actions, and what determines them.
Förstudie 2. Design 3. Migrering 4 Analys av befintlig miljö –Microsoft Assessment and Planning (MAP) kan användas för att analysera sin miljö.
Bastugatan 2. Box S Stockholm. Blad 1 Läsarundersökning Maskinentreprenören 2007.
The Swedish Travel Card
Erik Stenborg Swedish adaptation of ISO TC 211 Quality principles.
För att uppdatera sidfotstexten, gå till menyn: Visa/Sidhuvud och sidfot... E-services – what’s now and what’s next for the Swedish Pensions Agency? Mikael.
Barnneurolog Barnkliniken KRYH
Adult education in Sweden is extensive and has a long tradition. Adult education exists in many different forms and is organized by many different operators,
ETSI II SP 3: Bridge Aesthetics and Cultural Effects Aim: To relate aeshetical, environmental and cultural values with other important aspects of bridge.
Arbetsförmedlingen The Swedish Public Employment Service.
Who används för att fråga efter personer
Self Service in the Enterprise Patrik Sundqvist.
15% av inneliggande patienter på svenska sjukhus är intagna på grund av felmedicinering ofta orsakad av ej identifierad njurfunktionsnedsättning dvs nedsatt.
Figure Types of analog-to-analog modulation
Transport models Are they really that important? Christian Nilsson, WSP 17 October 2014.
Tankesmedja med REK den 19 september 2014 ”Hur kan innovationsmodeller och innovationsledning bli ett stöd för utbildningsaktörer och SME?”
Skills for Growth An Introduction Nils Karlson, Associate Professor CEO The RATIO Institute
Swedish ports A linchpin in Swedish industry. 95% of Swedish foreign trade is transported through a port.
Presens och imperfekt av have. Translate! Jag har huvudvärk. Hon har en röd Volvo. De har två barn tillsammans.
Från Gotland på kvällen (tågtider enligt 2007) 18:28 19:03 19:41 19:32 20:32 20:53 21:19 18:30 20:32 19:06 19:54 19:58 20:22 19:01 21:40 20:44 23:37 20:11.
E-lärande och högre utbildning Utveckling och pedagogiska utmaningar.
TÄNK PÅ ETT HELTAL MELLAN 1-50
Best pictures on the internet 2007 Awards 1http:// Är vänsteralliansen trovärdig i Norrköping.
Direct translation no complete intermediary sentence structure translation proceeds in a number of steps, each step dedicated to a specific task the most.
FIRMA OCH VARUMÄRKESENKÄT Näringslivets syn på firma och varumärken Industry’s view of trade names and trademarks.
För att uppdatera sidfotstexten, gå till menyfliken: Infoga | Sidhuvud och sidfot Fondbolagsträff 2015.
Best pictures on the internet 2007 Awards 1http:// (s), (v), och (mp) i Norrköping, gillar inte att vi använder grundlagarna.
To practise speaking English for 3-4 minutes Genom undervisningen i ämnet engelska ska eleverna ges förutsättningar att utveckla sin förmåga att: formulera.
1-1 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 1-1 Programmering 7.5 hp Programmering är... creativ, fascinerande, roligt,
STEPS TO FOLLOW FOR BECOMING A SHIP CAPTAIN A career as a ship captain can be a tedious task. Ship captains take care of business, navigation and operation.
Advice from Bronx Best Real Estate Attorney. Jagiani Law office of New York has been successfully working as divorce attorney & Real estate attorney for.
Digitization and Management Consulting
Why you should consider hiring a real estate attorney!
Types of Business Consulting Services Cornerstoneorg.com.
Bringapillow.com. Online Dating- A great way to find your love! The words ‘Love’ and ‘Relationship’ are close to every heart. Indeed, they are beautiful!
Work of a Family law attorney Jagianilaw.com. A Family Law Attorney basically covers a wide range spectrum of issues that a family may face with difficulty.
Meeting singles had never been so easy before. The growing dating sites for singles have given a totally new approach to getting into relationships. ‘Singles.
Formal Languages, Automata and Models of Computation
You Must Take Marriage Advice to Stop Divorce! Dontgetdivorced.com.
A pathway of change (1) Long term outcome
Integrates many areas of study (science, math, language arts) into one project.
What is so important about water?
USD 475 Parent Information for Digital Citizenship
16-19 ESOL Provision in Leicester City Snapshot of provider delivery
B/c there is more to structure than <h1> and <p>
Office of Special Education and Early Intervention Services UPDATES
The Early Universe (cont.)
Tools: Check the motivation
Concept of Operation Advance notice may be sent or trucks arrive at initial state delivery site with detailed packing list Packing List copied One to.
Bits in the Air Airwaves have been regulated by the government (FRC, FCC) for years First radio transmissions (wireless telegraph) were unregulated and.
Your Research Question
Presentationens avskrift:

Extraction of Ontological Information from Corpora (and Lexicon) Dimitrios Kokkinakis dimitrios.kokkinakis@svenska.gu.se Maria Toporowska Gronostaj maria.gronostaj@svenska.gu.se 1

Outline Goals & Observations, Resources Related Research Extending the Coverage of Semantic Resources (S-SIMPLE: Quality but not Quantity) Why and How? Key Issues Investigated for the Acquisition Compounding vs. Syntactic Parsing & Large Corpora vs. Defining Lexicons Pilot study regarding lexico-syntactic patterns Enhancement – What has been achieved? Error Analysis – For parts of the studies… Conclusions & Future Plans 2

Goals Extend & enrich the coverage of the Swedish semantic lexicon: as automatically as possible as inexpensive as possible (using whatever support was available) re-using lexical resources (not neccessarily semantic) Test ideas regarding: context similarity similarity in NPs of Enumerative Type (+ evaluation) - breadth the power of compounds - breadth bootstrapping the SIMPLE content using lexico-syntactic patterns for hyper/hypo relations - depth (statistical means) research conducted 00-01 3

Observations & Hypotheses Take into account the compounding characteristic of Swedish + easier to identify (cmp to English-at least in raw text) - harder to segment/analyse (cmp to English) + a lot of disambiguated compounds in our lexical DB Observation-2: Yet another view of context similarity (see Related Research) Members of a semantic group are often surrounded by other members of the same group in text; in other words: words entering into the same syntagmatic relation with other words can be perceived as to be semantically similar Observation-3: Apply lexico-syntactic patterns á la Hearst for more complex relations (pilot…) – why? because during the previous 2 steps (see later discussion) we mainly extract synonymic/co-hyponymic entries 4

Resources Core SIMPLE lexicon Gothenburg Lexical DataBase (GLDB) 10,000 semantic units ( 6,000 words) a vital part of the different entries' semantic unit is the notion of semantic class whose value is an element in a semantic class list (95 classes) hierarchically structured (LexiQuest) content: high quality; manually compiled and verified, but… limited vocabulary - quantitatively insufficient for HLT Gothenburg Lexical DataBase (GLDB) ca 70,000 lexical entries monolingual defining lexicon – for human readers (but + RDB-format) advantage (particularly for this study): a number of synonymic compounds Corpora ca 40 mil. tokens (syntactically analysed) 5

Related Research (1) context similarity plays and important role in word acquisition … so, common characteristic of most approaches is the computation of the semantic similarity between two words on the basis of the extent to which words' average contexts of use overlap usual assumption: members of the same semantic group co-occur in discourse [cf. Riloff&Sheperd, 97] use of syntax for generating semantic knowledge based on distributional evidence & syntagmatic relations is found in most previous research 6

Related Research (2) Approaches in general – steps: Extract word co-occurrences (most crucial part) usually gathered based on certain relations, e.g. predicate-argument modifier-modified, adjacency,… Define similarities between words on the basis of co-occurrences (+linguistic knowledge) combine existing linguistic knowledge (seed lex.) & co-occur. data Cluster words on the basis of similarities e.g. by using the contexts of the words as features and group together the words that tend to appear in similar context for compensating the sparseness of the co-occ. data 7

Related Research (3a) Hearst (1992): lexico-syntactic patterns – discovered by observation - for extracting hyponymy relations from corpora e.g. NP {,NP}* {,} and other NP temples, treasuries and other important civic buildings Grefenstette (1994): extract corpus-specific semantics in parsed text using (weighted) Jaccard (between two objects m and n is the num. of shared attributes divided by the number of attributes in the unique union of the set of attributes for each object) e.g. comparing ‘dog‘ & ‘cat‘ via textually derived attributes and binary Jaccard measure dog/pet-DOBJ dog/eat-SBJ dog/brown dog/shaggy dog/leash cat/pet-DOBJ cat/pet-DOBJ cat/hairy cat/leash count({attribs shared by cat and dog})/count({uniq attribs possesed by cat or dog}) brown eat hairy leash pet-DOBJ shaggy leash pet-DOBJ =2/6=0,333 8

Related Research (3b) Lin (1998): constructing a thesaurus using syntactically parsed corpora containing dependency triples: ||word1 relation word2||frequency; word similarity measure is defined based on the distributional pattern of words (“the similarity between 2 objects is defined to be the amount of information contained in the commonality between the objects divided by the amount of information in the descriptions of the objects”) e.g.: ||cell, pobj-of, inside||=16 (dependeny triple=2 words+gram. relation) I(w,r,w’)=log (||w,r,w’||x||*,r,*||)/(||w,r,*||x||*,r,w’||) similarity between 2 words (w1,w2) is based on: ((r,w)T(w1)T(w2) (I(w1,r,w)+/(w2,r,w)) / ((r,w)T(w1) I(w1,r,w)+ (r,w)T(w2) I(w2,r,w)) Roark & Charniak (1998): noun-phrase co-occurrence statistics (actually bigrams ranked by log-likelihood) for semi-automatic semantic lexicon construction; input is a parsed corpus and initial seed words (= the most frequent head nouns in a corpus [top200-500]) – based on conjunctions cars and trucks, lists planes, trains and automobiles, appositives and noun compounds pickup truck I(w,r,w’) the amount of info in ||w,r,w’|| 9

Related Research (3c) Takunaga et al. (1997): new words (nouns) are classified on the basis of relative probabilities of a word belonging to a given word class, with the probabilies calculated using noun-verb co-occurrence pairs (japanese+BGH thesaurus) – algo. originally developed for document categorization – each noun is represented by a set of co-occuring verbs Lin & Pantel (2002): each word is represented by a feature vector, each feature correspond to a context in which the word occurs (threaten with _ is a context and if handgun occurred in that context the context is a feature of handgun) the value of a feature is the MI between feature and the word; similarity between 2 words is calculated using cosine coef. of their MI vectors – clustering is then based on these results 10

So… enhancing SIMPLE by… …Analyzing Compounds a large number of compounds can inherit relevant parts of semantic info provided that the heads of lexemes occur in SIMPLE; testing for lexicalisation in GLDB in order to avoid incorporation of idiomatic or metonymic meanings; applying compound segmentation …Semantic similarity in NPs of enumerative type use of partial parsing on large corpora; words entering into the same syntagmatic relation with other words are perceived semantically similar; however, certain conditions must be satisfied in order to avoid incorporation of erroneous entries …Lexico-syntactic patterns for acquiring higher in the hierarchy concepts  see examples 11

Extending SIMPLE … illustration Compounding example: färja?, kryssningsfartyg?, tankers? och ro-ro-fartyg? >> No matches ferries, cruise-ships, tankers and ro-ro-vessels färja? kryssnings#fartygVEH tankers? ro-ro-#fartygVEH >> färjaVEH kryssningsfartygVEH tankersVEH ro-ro-fartygVEH Enumerative NP example: juristerOCC-AG, läkareOCC-AG, optikerOCC-AG, psykologer? och sjukgymnaster? >> 3 Matches lawyers, doctors, opticians, psycologists and physiotherapists >> condition: if >2 have same tag & rest no ==> add in lexicon! >>psykologOCC-AG sjukgymnastOCC-AG Lexico-syntactic pattern example: älgar, sorkar, fåglar, kor, hästar och andra djur 12

Compounding take advantage that Swedish is a compounding language (e.g. >70% of SAOL are compounds) single orthographic units many compound words are lexically not represented generally having predictable meanings - relatively transparent most compounds are essentially binary & in most cases both elements are represented in GLDB given a sizeable number of analysed compounds its possible to automatically establish a ”semantic compounding profile” for all lexemes in predictable compounds meaning as a function of the meaning of the components related to each other by an implied predicative functor e.g. brödkniv brödXknivY ‘bread knife’ implies ‘Y for (cutting) X ’ used compounds from the GLDBs synonym-slot … and corpora … but the have to be segmented & anaysed see Järborg, Kokkinakis & Toporowska-Gronostaj, ’02 13

Semantic Compound Definitions Semantic Definition Example Y that is located in/at… klassrumsdörr classroom+door Y that is made up of X kanalsystem canal+system Y that originates from X smutsfläck dirt+stain Y that is aimed at X kaninjakt rabbit+hunt Y that is about X partikelfysik particle+physics Y that produces X batterifabrik battery+factory Y that prevails in X partiideologi party+ideology Y that contains X kaffetermos coffee+thermos Y that consists of X kaffepulver coffee+powder Y that has to do with X ....... klädbesvär clothes+trouble ....... 14

An Example Profile for ´område´ marknad.1.2.0 avrinning.1.1.0 mark.1.2.0 affär.1.2.b bangård.1.1.0 kommunikation.1.2.0 barrskog.1.1.0 avtal.1.1.0 kompetens.1.1.0 kust.1.1.0 katastrof.1.1.0 område.1.1.0 <geogr.> område.1.1.b <abstr.> land.1.1.b Medelhavs.PM kunskap.1.1.a kultur.1.2.0 Luleå.PM kärna.1.1.c marknadsföra.1.1.0 läkemedel.1.1.0 kostnad.1.1.0 myr.1.1.0 motiv.1.2.0 15

Compounds fr. GLDB SIMPLE (5) GLDB (26) ämne:1/1:Matter already disambiguated... GLDB & S-SIMPLE entries linked to the sub-senses in GLDB e.g. S-SIMPLE encodes the non-compound lemma ämne (as having 4 senses, marked 1/1-1/4), which are disambiguated here by means of their assignment to the following semantic types and semantic classes: Material: Matter ‘material’ Substance: Substance ‘stoff’ Part: Abstract ‘topic’ Domain: Notion ‘subject, discipline’ Each of the senses is exemplified in GLDB with a number of compounds, comprising 26 in total with ämne as the head SIMPLE (5) GLDB (26) ämne:1/1:Matter grundämne:1/1:Matter ämne:1/2:Substance ämne:1/3:Abstract ämne:1/4:Notion färgämne:1/1 hornämne:1/1 … yxämne:1/2 fruktämne:1/2 predikoämne:1/3 uppsatsämne:1/3 läroämne:1/4 skolämne:1/4 16

Compounds fr. Corpora Heuristic compound decomposition/segmentation and matching of the SIMPLE content with the heads of the segmented compounds Try to distinguish the modifier’s characteristics (pos & semantic category - if any) is modifier=adjective or proper-noun? OK e.g. klocka digital||klocka; stor||klocka anhängare anhängare Hitler||anhängare; Likud||anhängare S-SIMPLE as a means of bootstrapping the process e.g. glas ‘glass’, extended with compounds having SUBSTANCE as a modifier:[vatten,vin,öl,likör]glas: ‘water, wine, beer’ and ‘liqueur’ Check against lists of lexicalized ones to eliminate incorrect data => GLDB allow the exclusion of such compounds from the derived sets e.g. feber - 40 compounds from corpora, e.g. scharlakansfeber - but not all are ILLNESS ‘resfeber’ ‘diamantfeber’ 17

Heuristic Compound Segmentation previous attempts to segment Swedish compounds without the help of a “real” lexicon are described in Brodda (1979) based on the distributional properties of graphemes, trying to identify grapheme combinations indicating possible boundaries (promising for Germanic languages) mostly automatic with some manual work sd sg tk tp is||dans (ice-dance) bidrags||givare (contributor) bröst||kirurgi (breast surgery) vit||peppar (white pepper) dsb psr psd ftv rnk lands||bygd (countryside) bröllops||resa (honeymoon trip) kropps||delen (body part) luft||värme (air warmth) kärn||kraft (nuclear power) ngss tsfa gssp spla spap honungs||sött (honey sweet) besluts||fattare (decision-maker) vardags||språket (colloquial language) femårs||plan (five year plan) bakplåts||papper (baking-plate paper) 18

Compound Processing cont´d Estimation >20-25 compounds per S-SIMPLE entry (for NOUNS) Based on: 1,000 nouns in SIMPLE; increased the vocabulary to >22,000 The top-5 non-compound entries from corpora, most rich in compound variants (some very ambiguous!) program ‘programme, program’ (469 diff. comp.) arbete ‘work, employment’ (402 diff. comp.) chef ‘chief’ (390 diff. comp.) bok ‘book’ (357 diff. comp.) verksamhet ‘activity, operation’ (299 diff. comp.) 19

Modifier’s Characteristics bad||toffla#garment barn||vårds||lärare#occupation_agent bas||bolag#agency bläck||fisk#fish bolags||plundrare#occupation_agent brud||bergs||skola#abstract#agency#functional_space bygg||bolag#agency bygg||företag#agency centralbanks||chef#occupation_agent doping||brott#change dt rnv, dsl sb kf gspl gss, db gb gf ksch ngb SIMPLE 20

Syntactic Parsing (1) Compounds are a valuable resource; but how can we cope with the rest of the vocabulary? Corpus-driven approach to acquire semantic lexicons cf. Kokkinakis, 2001 Investigate how, and to what extent the flexibility and robustness of a partial parser can be utilized to fully automatic extend existing semantic lexicons - cascaded finite-state syntactic parser; Observation: members of a semantic group are often surrounded by other members of the same group in text; in other words: words entering into the same syntagmatic relation with other words are perceived as semantically similar 21

Syntactic Parsing (2) Corpus: 40 mil. tokens (Swedish Language Bank) tagged with Brill's tagger Parsing using CASS-SWE in which levels or bundles of rules of very special characteristics & content can be rapidly created & tested e.g. specific types of NPs (takes pos-tagged texts as input) Example - simplified: Rule => ‘DETERMINER? COM-NOUN (COM-NOUN F)* COM-NOUN CONJ COM-NOUN’ (färger, penslar, papper och matsäckar) Rule => ‘APPOSITION-NOUN? PROP-NOUN+ (F PROP-NOUN)+ CONJ PROP-NOUN+’ (Venezuela, Trinidad och Island) Amount of unique retrieved phrases were ca 36,000 (phrases without proper names) and ca 72,000 (phrases with proper names) 22

Syntactic Parsing (3) 1. Gather, pos-annotate & parse large corpora 2. Filter out long NPs; & Filter out knowledge-poor elements 3. 1st Pass: Measure the overlap between the members of the phrases extracted and the entries in the semantic lexicon; 3a. If conditions apply, add new categorised entries in the database; 3b. Repeat the previous 2 steps, until very few or nothing is matched; 4. 2nd Pass: Compound segment members of the phrases left; 4a. Check whether they are lexicalised, do not use them if they are; 4b. Repeat the process from step (3) by matching this time the heads with the content of the database 23

Syntactic Parsing (4) Large quantities of partially parsed corpora is an important ingredient for the enrichment and further development of the semantic resources – cf. all previous attempts: use syntax for generating semantic knowledge From the forest of chunks produced, filter out long NPs (=>3 Com. Nouns), lemmatise, normalise, filter out knowledge-poor elements (determiners, punctuation) & measure the overlap between the nouns in the NPs and the entries in S-SIMPLE If at least 2 of the nouns in the NPs are entries in SIMPLE, with the same semantic class, then there is a strong indication that the rest of the nouns are co-hyponyms, thus semantically similar with the two already encoded in S-SIMPLE – iterate Apply compounds segmentation on the members of the phrases left – check for lexicalization in a def. dictionary (GLDB) don’t use them are lexicalized – repeat previous step & iterate BUT match the heads! 24

First Pass Overlap Matching a db with the content of the resources against the content of the phrases Assume: if at least 2 of the members of a phrase are also entries in the lexicon, with the same semantic class, and the rest of the phrase members have not received a semantic annotation, then there is a strong indication that the rest of the members are co-hyponyms, and thus semantically similar with the two already encoded in the lexicon. Accordingly, we annotate them with the same semantic class e.g. lawyers, doctors, opticians, psycologists and physiotherapists juristerOCC-AG, läkareOCC-AG, optikerOCC-AG, psykologer? och sjukgymnaster? ===> 3 Matches ==> condition: if >2 have same tag & rest no ==> add in lexicon! psykologOCC-AG sjukgymnastOCC-AG 25

Second Pass Overlap A large number of phrases not used; none or only one of the members of the phrases was covered by SIMPLE, either the original or the enriched version Take account the compounding characteristic of Swedish (> 70% or 80,000 in SAOL are compounds); Heuristic decomposition of compounds & matching the SIMPLE content with the heads of the segmented compounds Assume: a considerable number of casual or on the fly created compounds can inherit relevant parts of semantic info. provided on their heads by SIMPLE e.g.: färjor?, kryssningsfartyg?, tankers? och ro-ro-fartyg? ===> No matches (ferries, cruise-ships, tankers and ro-ro-vessels) färja? kryssnings||fartygVEH tankers? ro-ro-||fartygVEH ===> färjaVEH kryssningsfartygVEH tankersVEH ro-ro-fartygVEH 26

Syntactic Parsing (5) Result: Errors/noise can be eliminated, if the semantic tags of all the words in a phrase are compared kvinnor:BIO, barn:BIO, husdjur:??? och möbler:FURNITURE Ambiguities are propagated flaskor:CONTAINER-AMOUNT, tallrikar:CONTAINER-AMOUNT, vinglas:??? Result: Approx. 3,300 new noun entries to the Swe-S could be identified without any further processing (i.e. bootstrapping the compound analysis) – and only during the ‘first pass’ 27

Loooong NPs (1) har jag ätit ko, gris, lamm, häst, hare, kanin, ren, älg, känguru, orre, tjäder, duva, kyckling, anka, gås, struts, krokodil, haj, lax, torsk, abborre, gädda, bläckfisk och en massa firrar till … ekonom sociolog litteraturvetare stadsplanerare mediaexpert filosof reklamfolk företrädare formgivare ingenjör författare diktare filmare popmusiker leksaksfabrikant klädskapare arkitekt journalist vetenskapsman... (press98) inflationsutveckling framtidstro orderingång arbetsmarknadspolitik företagsbeskattning ränteläge handelshinder investeringstakt råvarupris produktionsutveckling… slangnipplar slangpumpar flödesmätare gummihandskar röntgenapparater proteser testcyklar diskmaskiner journalsystem bensågar kuvöser blodmixrar urintestremsor centrifuger... (press95) bokstav måttband klocka miniräknare plastbestick barnbild nyckel batterier filmrulle (SUC) 28

Loooong NPs (2) Belgien Danmark Frankrike Grekland Island Italien Kanada Luxemburg Nederländerna Norge Portugal Spanien Storbritannien Turkiet Tyskland USA… (p97) all världens ortnamn : Lahti , Kalundborg , Oslo , Motala , Luleå , Moskva , Tromsö , Vasa , Åbo , Rom , Hilversum , Vigra , Bryssel , London , Prag , Athlone , Köpenhamn , Stuttgart , München , Riga , Stavanger , Paris , Warszawa , Bodö och Wien… (romii) Birte Heribertson Bodil Mårtensson Anette Norberg Bror Tommy Borgström Karin Bergqvist Mats Ågren Mattias Renehed Tobias Ekstrand… (p96) Robert Hedman , Kjell Jönsson , Ingemar Eriksson , Jonas Runesson , Miguel Exposito , Micke Berg , Lars Oscarsson , Fredrik Aliris , Jimmy Anjevall , Putte Johansson , Petter Jokobsson , Daniel Edfalk , Mattias Larsson , Daniel , Westerlund , Daniel Johansson , Peter ... 29

Evaluation (1) Quantity Evaluation of the Syntactic Parsing approach (see Kokkinakis, 01) Results after six iterations: Original Pass-1 Pass-2 Total SIMPLE 2,921 5,110 1,100 9,131 NAMES 10,550 25,700 --- 36,250 30

Evaluation (2) Quality Evaluation: Manually, for a number of groups based on common sense and judgement Class Original New Wrong/Spurious Precision OrganisationNE 1300 395 22 94,4% Phenomenon 36 29 9 69% Bio 46 107 12 88,8% Ideo 17 74 97,8% Vehicle 33 118 85,6% Apparatus 27 2 92,6% Garment 25 184 19 89,7% Illness 38 66 8 87,9% Flower 26 3 88,5% 31

Examples of Acquired Entries (1) BIO: any classification of human beings (groups or individuals) according to a biological chracteristic like age, sex, etc; i.e. adult, twin, brother, bastard, husband, miss… ORIGINAL (46): bror, fru, hustru, son, tjej, gudbarn, ... NEW (107): barn, barnbarnsbarnbarn, children!!, dotter, dotterdotter, fader, far, farbror, farfader, farfarsfar, farförälder, farmoder, faster, flickvän, fosterförälder, fästmö, huskarl, hustru, jungfru, kusin, … SPURIOUS/WRONG (12): orientarmé, regnskog, sjukhuspersonal, skilsmässa, sopa, studieförbund, svågra, totalisatorspel, trapetsartist, tutsier, älder, äppelträd PRECISION: 88,8% 32

Examples of Acquired Entries (2) APPARATUS: tools or devices used together to provide a particular functionality for a particular task; i.e. dishwasher, camera, computer, recorder… ORIGINAL (22): video, kamera, frys, kopiator, mixer, ... NEW (27): bandspelare, cd-rom-läsare, cd-spelare, dator, dvd-spelare, faxapparat, filmkamera, frysbox, handdator, nätverksdator, radio, skrivare, symaskin, televisionsapparat, teve-apparat, tv-apparat, videoapparat, ...  SPURIOUS/WRONG (2): fonduegryta??, skafferi PRECISION: 92,6% 33

Examples of Acquired Entries (3) VEHICLE: artifacts (or their parts) made for the transport of goods, livestock or people; i.e. truck, sedan, bicycle, license plate!!!,submarine… ORIGINAL (33): kajak, bil, jeep, båt, flotte,… NEW (118): ambulans, brandbil, buss, charter, direktbuss, distributionsbil, elbil, flakmoped, flakmoppa, flodbåt, flyg, flygplan, fordon, fregatt, färja, helikopter, husvagn, hästfordon, hästkärra, korvett, krigsfartyg, lastvagn, … SPURIOUS/WRONG (17): anläggningsmaskin, arbetsmaskin, artilleri, artilleripjäs, entreprenadmaskin, förband, förvaltningsmyndighet, gräsklippare, skida PRECISION: 85,6% 34

Evaluation (3) Quality Evaluation nr2 Comparison with 2 Synonym Dictionaries STRÖMBERGS & BONNIERS (Missing in STR+BON: ösregn, spöregn, hällregn! SIMPLE Label STR+BON (x+x=unique) Missing in SIMPLE bil - car VEHICLE 7+8=11 3 – vagn, kärra, åk regn - rain PHENOM. 17+14=21 15 – väta, ström, flod, dusch, kaskad, våtväder etc. rederi – shipping company AGENCY 3+4=6 5 – skeppsägare, linje, båtbolag, fartygsbolag, sjöfartsbolag 35

Error Analysis Source of Errors: Part-of-speech and lemmatisation errors A number of long, enumerative NPs with many unknown to the lexicon entries, where 2 or 3 (happened) to correctly get the same semantic label but some the wrong one tröjaGARMENT halsduk strumpaGARMENT underkläder skiva album => GARMENT ... assigned to the rest... … and of course polysemy depressionEMOTION ångestEMOTION spänning? => EMOTION ...but tryckATTRIBUTE spänningEMOTION? vibration tyngdkraftATTRIBUTE 36

Lexico-syntactic Patterns Compounding and enumerative NPs are a good starting point for acquiring synonyms & co-hyponyms Pattern based lexico-syntactic recognition is suitable for acquiring hyperonyms-hyponyms (and partly meronyms) Language specific patterns Discovery by observation A good parser is necessary – good coverage of NPs Requires more research on the effects of the various modifiers that can alter the semantic relation 37

Lexico-syntactic Patterns (1) hyperonym-hyponym NP av (typ/en|märke/t|model/len|…) ("|'|:)? (NP|(NP,)+) (och NP|eller NP)? … en bil av märket Ford Granada … … okänd soldat som bar gymnastikskor av märket Nike … … sys bland annat kalsonger och undertröjor av märket Börje Salming … … tusen personbilar av modellen S70/V70 i Masas fabrik . … planen är av typen F117A ( stealth ) … … fartygen har jaktplan av typen F14 som anpassats att bära laserstyrda … 38

Lexico-syntactic Patterns (2) Hyperonym?-hyponym? NP ,? (såsom|liksom|som)(NP|(NP,)+|:NP|:(NP,)+) (eller|och) (andra|annat|annan) NP NP ,? (eller|och) (andra|annat|annan) NP NP ,? (såsom|liksom|som) (andra|annat|annan) NP … explorer plockar poäng på automatlåda , farthållare , luftkonditionering , radio och annan utrustning … fastighetsägaren ville ha en total renovering med ny spis , kyl , frys , spiskåpa och annan köksinredning NP : NP (NP ,)+ (m fl|med flera|mm|osv)? … årets dansband : Arvingarna , Barbados , Joyride , Sound Express . … riksdagsmännens alla bidrag : barnbidrag , bostadsbidrag , socialbidrag , studiebidrag osv . … kroniskt sjuka : epileptiker , hjärtsjuka , njursjuka m fl … bästa webbplatserna : Spray , Gula Sidorna , Dagens_Nyheter , Passagen , Arbetsförmedlingen , Resfeber , Pricerunner , Bidlet , SEB och Bluemarx . hyperonym-hyponym 39

Lexico-syntactic Patterns (3) hyperonym-hyponym NP ,?|(? inklusive (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )? NP ,? (? särskilt (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )? NP ,? (? speciellt (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )? NP ,? (? mestadels (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )? NP ,? (? däribland (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )? … en rad företag , däribland Ica , Dagab och Ikea … Natoländer , inklusive Frankrike , Tyskland , Spanien och Grekland NP som (till exempel|t ex|t.ex.) NP (, NP)* … stora båtar som till exempel segelfartyg … storhelger som t ex nyårsdagen , juldagen har vi … … finns det specialavdelningar att se på mässan? som t ex Classic boat show , surfexpo , sjösäkerhet och dykexpo . hyperonym-hyponym 40

Lexico-syntactic Patterns (4) hyperonym-hyponym (sån/a/t|sådan/a/t)? NP ,? (som|såsom) (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? … välkända biorullar såsom Carrie , Eldfödd , Stalker , Den onda cirkeln , Shining och Matilda … flera färger såsom lichtgult , svart , vitt , rött , blått , grönt , … en rad underspecialiteter , såsom kardiologi , gastro-enterologi , endokrinologi , hematologi , njurmedicin och reumatologi . NP : NP (, NP)+ (och NP|eller NP)? … leverantörerna av affärssystem : SAP , Intentia , IFS och IBS … folksjukdomarna : alkoholism , ätstörningar , medicinmissbruk och panikångest … krafter av olika slag : tyngdkraft , muskelkraft , friktionskraft , magnetisk kraft hyperonym-hyponym 41

Lexico-syntactic Patterns (5) hyperonym-hyponym NP (, NP)+ är några av NP …" Nilens dotter " , " Sorgens stad " och " Marionettmästaren " är några av de filmer … … La-Seyne-sur-Mer , Orléans , Brest och Dijon är några av de städer… … språk , internationell rätt , utrikes- och säkerhetspolitik , press- och informationsfrågor , administration samt muntlig och skriftlig framställning är några av de ämnen som studeras … … El Salvador , Kazakstan och Jamaica är några av de länder som nu … NP? som? består?SENSE? av NP (, NP)+ (och NP)? … instrumentalensemblen? som består av flöjt , klarinett , trombon, gitarr , violin ,… …” De ensamma öarna?” som består av Koufonissi , Iraklia , Donousa och Schinousa … av företagsamhet som består av produktutveckling , produktion , distribution och försäljning holonym-meronym 42

Conclusion & Outlook simple, surprisingly efficient methods to acquire/enhance general purpose semantic knowledge from large corpora profiting from the productive compounding characteristic of S. use of partially parsed corpora for extending semantic lexicons, a unified way to process compounds both parsing & compounding are of equal importance, through parsing we allow the incorporation of new, mainly non-compound words, through compounding we allow new compounds of existing entries; Kokkinakis et al. ’00 better means of evaluation and decrease the amount of spurious generated entries (many due to pos) 43

Conclusion & Outlook cont´d We believe that S-SIMPLE can be extended to a large semantic resource appropriate for a large number of (intermediate) NLP tasks; Its compatibility with the manually developed S-SIMPLE lexicon, can be guaranteed and its high quality maintained near future - NOV ‘03: expect evaluation from VR – whether our application will get funded or not – passed through 1st step but that doesnt guarantee success ==> goal: larger corpus; more comprehensive study; combine compounding, parsing, patterns and statistics 44

References Brodda B. (1979). Något om de svenska ordens fonotax och morfotax: Iakttagelse med utgångspunkt från experiment med automatisk morfologisk analys. In: ”I huvet på Benny Brodda”. Festskrift till densammes 65-årsdag. Grefenstette G. (1994). Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publishers. Hearst M. (1992) Automatic Acquisition of Hyponyms from Large Text Corpora. Proceedings of the 14th International Conference on Computational Linguistics. Nantes, France Järborg J., Kokkinakis D. & Toporowska-Gronostaj M. (2002). Lexical and Textual Resources for Sense Recognition and Description. Proceedings of the 3rd LREC, Las Palmas. Kokkinakis D., Toporowska Gronostaj M. and Warmenius K. (2000) Annotating, Disambiguating & Automatically Extending the Coverage of the Swedish SIMPLE Lexicon. Proceedings of the 2nd Languages Resources and Evaluation Conference (LREC), vol. III:1397-1404. Athens, Hellas. Kokkinakis D. (2001). Syntactic Parsing as a Step for Automatically Augmenting Semantic Lexicons. Proceedings of the 39th Association of Computational Linguistics (ACL) and 10th European Chapter of the Association of Computational Linguistics (EACL), 13-18. Miltsakaki E., Monz C. and Ribeiro A. (eds). (Companion Volume). CNRS, Toulouse, France. Lin D. (1998). Automatic Retrieval and Clustering of Similar Words. COLING-ACL98, Montreal, Canada. Lin D. & Pantel P. (2002). Concept Discovery from Text. Proceedings of the International Conference on Computational Linguistics. pp. 577-583. Taipei, Taiwan. Riloff, E., and Shepherd, J. 1997. A Corpus-Based Approach for Building Semantic Lexicons. Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, 117--124. Roark B. & Charniak E. (1998). Noun-phrase co-occurrence statistics for semi-automatic semantic lexicon construction. Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, pages 1110-1116. Takunaga et al. (1997) Extending a thesaurus by classifying words. Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications. 45