A Robust Shallow Parser for Swedish Ola Knutsson, Johnny Bigert, Viggo Kann Royal Institute of Technology, Sweden
Introduction What is robustness? Robust against noisy, ill-formed and partial natural language data
Shallow parsing Many NLP-applications do not need full parsing Shallow parsing: A parsing approach Pre-processing for full parsing A collection of techniques Abney - finite state cascades (1991) Currently, a lot of attention on ML Well suitable for modularization
Chunking and phrase identification Common modules in a shallow parser: Tokenizer PoS-tagger Chunker Phrase identifier Grammatical function identifier
Chunking [NP Den mycket gamla mannen][VC gillade][NP mat] Phrase identification [NP Den [AP mycket gamla] mannen][VC gillade][NP mat]
Parsers for Swedish Full parser: UCP (Sågvall Hein) and SLE (Gambäck) Shallow parsers (phrase structure): Cass- Swe (Kokkinakis) and Megyesi using machine learning Dependency: CG (Birn) and FDG (Voutilainen)
Granska Text Analyzer (GTA) Hand-crafted rules Context-free backbone Partly object-oriented notation
Major Phrase Categories NP: Han såg den lilla mannen på bänken VC: Han har spelat kort hela natten PP: Han såg spår i sanden AP: Han ogillade små vita lögner ADVP: Han vill inte gå på bio. INFP: Han tycker om att spela
Clause Boundary Identification Based on Ejerhed’s algorithm Context-sensitive rules Using only PoS information
Different kinds of rules GTA contains 260 rules 200 identify phrase structure 20 clause boundary identification 40 selection rules (disambiguation)
Example rule, [NP den lilla bilen] { X(wordcl=dt| wordcl=hd | wordcl=rg), X2(wordcl=ab | wordcl=rg)?, Y(wordcl=jj | wordcl=ro | wordcl=pc)*, Z(wordcl=nn) --> action(help, wordcl:=Z.wordcl, pnf:= undef, gender:=Z.gender, num:=Z.num, spec:=Z.spec, case:=Z.case)
Clause boundary rule V(sed!=sen & text!="som" & wordcl!=sn), X((wordcl=pn & pnf=sub)| (wordcl=pm & case=nom) | (wordcl=nn & case=nom & V.case!=gen) | wordcl=ab), ---endleftcontext---, Y(wordcl=kn), ---beginrightcontext---, Y2(((wordcl=pn & pnf=sub) | (wordcl=pm & case=nom) | (wordcl=nn & case=nom) | wordcl=ab) & wordcl=X.wordcl), Z(wordcl=vb & (vbf=prs | vbf=prt | vbf=imp)) --> action(help, wordcl:=Y.wordcl)}
The Tetris Algorithm NP boken NP Fänrik Ax PP till general Claes VC gav PP till general Claes Olsson NP general Claes Olsson PP till general
The IOB format Marcus and Ramshaw 1995 A phrase/clause tag contains two parts: 1.Phrase/Clause type, e.g. NP, PP 2.One of two tags: I = Inside a phrase/clause B = Beginning a phrase/clause When a word does not belong to a phrase 3. O = Outside
Disagreement error De dt.utr/neu.plu.def NPB CLB gamla jj.pos.utr/neu.plu.ind/def.nom APB|NPI CLI äppelträdet nn.neu.sin.def.nom NPI CLI kan vb.prs.akt.mod VCB CLI bli vb.inf.akt.kop VCI CLI som kn O CLI nya jj.pos.utr/neu.plu.ind/def.nom APB CLI. mad O CLI
Partial input Arrangör nn.utr.sin.ind.nom NPB CLB var vb.prt.akt.kop VCB CLI Järfälla pm.gen NPB|NPB CLI naturskyddsförening nn.utr.sin.ind.nom NPB|NPI CLI där ab ADVPB CLI är vb.prs.akt.kop VCB CLI medlem nn.utr.sin.ind.nom NPB CLI. madO CLI
Noisy data Inte ab APB CLB så ab ADVPB|APB|API CLI tjck jj.pos.utr.sin.ind.nom APB|API|API CLI som ha O CLB det pn.neu.sin.def.sub/obj NPB CLI ofta ab.pos ADVPB CLI står vb.prs.akt VCB CLI i pp PPB CLI lärobökerna nn.utr.plu.def.nom NPB|PPI CLI ; mid0 CLI
Word order violation Ympkvisten nn.utr.sin.def.nom NPB CLB inte ab ADVPB CLI ska vb.prs.akt.mod VCB CLI vara vb.inf.akt.kop VCI CLI sådär ab ADVPB|APB CLI lång jj.pos.utr.sin.ind.nom APB CLI, mid O CLI
Evaluation Manually corrected output from GTA Untuned GTA in the evaluation words from SUC 5 genres
F-scores for individual phrase types TypeAccuracyCount ADVP AP INFP NP O PP VC Total88.7
F-score for clause boundary identification TaggerF-score UNIGRAM84.2 BRILL87.3 TNT88.3 F-score for a baseline identifier was 69.0%
Aplications with GTA We are using GTA in: Grammar checking, statistical and rule based Clustering of medical texts CALL-systems What do you want to do with GTA?
More information Contact: Ola Knutsson