Stockholm: Dec -03Vector-based Semantic Analysis1 Vector-based Semantic Analysis 1 Leif Grönqvist Växjö University (Mathematics and.

Slides:



Advertisements
Liknande presentationer
Samhällsvetarkåren vid Lunds universitet Forum för internationalisering
Advertisements

Karolinska Institutet, studentundersökning Studentundersökning på Karolinska Institutet HT 2013.
Bastugatan 2. Box S Stockholm. Blad 1 Läsarundersökning Maskinentreprenören 2007.
Enkätresultat för Grundskolan Elever 2014 Skola:Hällby skola.
Hittarps IK Kartläggningspresentation år 3.
9e december 2003Statistiska metoder & IR1 Statistiska metoder & Information Retrieval Leif Grönqvist GSLT (Sveriges nationella forskarskola.
Från Gotland på kvällen (tågtider enligt 2007) 18:28 19:03 19:41 19:32 20:32 20:53 21:19 18:30 20:32 19:06 19:54 19:58 20:22 19:01 21:40 20:44 23:37 20:11.
Arbetspensionssystemet i bilder Bildserie med centrala uppgifter om arbetspensionssystemet och dess funktion
TÄNK PÅ ETT HELTAL MELLAN 1-50
Kouzlo starých časů… Letadla Pár foteček pro vzpomínku na dávné doby, tak hezké snění… M.K. 1 I Norrköping får man inte.
För att uppdatera sidfotstexten, gå till menyfliken: Infoga | Sidhuvud och sidfot Fondbolagsträff 2015.
Exempelbaserade specifikationer med SpecFlow
Några exjobbsförslag Leif Grönqvist Datalogi & Språkteknologi Växjö universitet, GU & GSLT.
Public Participation and Dialogue in Road-Planning and Road-Design Example: Bypass Norrtälje Suzanne de Laval SWEDEN.
26:e mars 2004Information Retrieval1 Datalingvistik – översiktskurs: Information Retrieval Leif Grönqvist GSLT (Sveriges nationella.
Förskoleenkät Föräldrar 2012 Förskoleenkät – Föräldrar Enhet:Hattmakarns förskola.
Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist1 Latent Semantic Indexing and Beyond Leif Grönqvist School of Mathematics and Systems.
Chapter Week Week 36 – Hello again. Survey Step 1: Write your name and the 1 of September on top of your survey sheet. CHAPTER WEEK Name:_ Emma __ Date:
7. March 2003Leif Grönqvist, MSI - Växjö1 Latent Semantic Indexing and Beyond Leif Grönqvist School of Mathematics and Systems Engineering.
 Who frågar efter en persons (eller personers) identitet (vem dem är).  Who is he?  Who are they?  Who is coming?
To practise speaking English for 3-4 minutes Genom undervisningen i ämnet engelska ska eleverna ges förutsättningar att utveckla sin förmåga att: formulera.
© Gunnar Wettergren1 IV1021 Project models Gunnar Wettergren
1-1 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 1-1 Programmering 7.5 hp Programmering är... creativ, fascinerande, roligt,
Don´t just try! Do! Emma Nääs
Lab Contact 1  Lab Assistants:  Meng Liu, Group B  Sara Abbaspour, Group A
Gränsöverskridande samarbete över fjäll och hav Min ansökan.
THINGS TO CONSIDER WHILE PLANNING A PARTY Planning an event can take an immense amount of time and planning. Even then, the biggest problem that arises.
STEPS TO FOLLOW FOR BECOMING A SHIP CAPTAIN A career as a ship captain can be a tedious task. Ship captains take care of business, navigation and operation.
Swedish for international students Kie FM Sandra Uitto Swedish teacher Language Center, Aalto University U-wing, II floor.
Advice from Bronx Best Real Estate Attorney. Jagiani Law office of New York has been successfully working as divorce attorney & Real estate attorney for.
Digitization and Management Consulting
Why you should consider hiring a real estate attorney!
Law abiding grounds of filing a divorce Jagianilaw.com.
Types of Business Consulting Services Cornerstoneorg.com.
Tennis as they see it Research on attitudes to tennis of junior tennis players through gender perspective.
Bringapillow.com. Online Dating- A great way to find your love! The words ‘Love’ and ‘Relationship’ are close to every heart. Indeed, they are beautiful!
Work of a Family law attorney Jagianilaw.com. A Family Law Attorney basically covers a wide range spectrum of issues that a family may face with difficulty.
The Online Way to Engagement and Wedding Jewelry! Pearlleady.com.
Meeting singles had never been so easy before. The growing dating sites for singles have given a totally new approach to getting into relationships. ‘Singles.
Hoppas det här går hem ! Bildspelet vecka 3 5 BE ® BrucElvis
Formal Languages, Automata and Models of Computation
My role model.
Svarsfrekvensen i undersökningar från webbpaneler. Några resultat
Pearlleady.com Attractive Graduation and Wedding Gifts Online.
How to Buy Engagement Rings for Women Online?. Buying engagement rings for women or tiffany celebration rings from the online market could be a bit challenging.
Amazing Wedding/Bridal Jewellery & Gifts Available Online Pearlleady.com.
You Must Take Marriage Advice to Stop Divorce! Dontgetdivorced.com.
Practice and challenges in involving fathers
UPPSALA UNIVERSITETSBIBLIOTEK
Industrial Mathematics: Modeling, Simulation, and Optimization
Season 2018.
Accounts + SD = ♥? SD indicators generated from an integrated statistical account New report financed by Eurostat, DG Environment and Statistics Sweden.
Eunis Research and Analysis Initiative
Publish your presentations online we present SLIDEPLAYER.SI.
Publish your presentations online we present SLIDEPLAYER.RS.
Publish your presentations online we present SLIDEPLAYER.IN.
Publish your presentations online we present SLIDEPLAYER.VN.
Publish your presentations online we present SLIDEPLAYER.RO.
Publish your presentations online we present SLIDEPLAYER.EE.
Publish your presentations online we present SLIDEPLAYER.CO.IL.
Publish your presentations online we present SLIDEPLAYER.AE.
Publish your presentations online we present SLIDEPLAYER.BG.
Publish your presentations online we present SLIDEPLAYER.AFRICA.
Publish your presentations online we present SLIDEPLAYER.MX.
Publish your presentations online we present SLIDEPLAYER.LT.
Publish your presentations online we present SLIDEPLAYER.LV.
Publish your presentations online we present SLIDEPLAYER.SK.
Packaging that makes life easier!
Presentationens avskrift:

Stockholm: Dec -03Vector-based Semantic Analysis1 Vector-based Semantic Analysis 1 Leif Grönqvist Växjö University (Mathematics and Systems Engineering) GSLT (Graduate School of Language Technology) Göteborg University (Department of Linguistics)

Stockholm: Dec -03Vector-based Semantic Analysis2 Outline of the talk  My background and research interests  Vector space models in IR The traditional model Latent semantic indexing (LSI)  Singular value decomposition (SVD)  Weaknesses in the model and how to avoid or get rid of them  Experiments – extended LSI

Stockholm: Dec -03Vector-based Semantic Analysis3 My background  : ”4-årig teknisk” (electrical engineering)  : M.Sc. (official translation of “Filosofie Magister”) in Computing Science, Göteborg University  : 62 points in mechanics, electronics, etc.  : Work at the Linguistic department in Göteborg Various projects related to corpus linguistics Some teaching on statistical methods (Göteborg and Uppsala), Corpus linguistics in Göteborg, Sofia, Beijing, South Africa  1995: Consultant at Redwood Research, in Sollentuna, working on information retrieval in medical databases  : Work at the department of Informatics in Göteborg (the Internet Project)  : PhD Student in Computer Science / Language Technology in Växjö

Stockholm: Dec -03Vector-based Semantic Analysis4 Research interests  Statistical methods in computational linguistics Corpus linguistics Hidden Markov models Tagging, parsing, etc. Machine learning  Information retrieval Vector space models containing semantic information  Co-occurrence statistics  LSI  Adding more linguistic information Clustering  Finite state technology

Stockholm: Dec -03Vector-based Semantic Analysis5 My thesis – purpose  Will be finished in 2006 ” The major goal of this investigation is to improve the vector model obtained by LSI, using linguistic information that could be extracted automatically from raw textual data. The kind of improvements in mind for specific applications are: Give a search engine the capability of disambiguate ambiguous words, and between different persons with the same name. Make it possible for a keyword extractor to find not just words, but a list of relevant multi-word units, phrases and words. ”

Stockholm: Dec -03Vector-based Semantic Analysis6 My thesis – purpose, cont. ” The starting point is that multi-word units in the vector model could give the improvements above, since the model as it is includes only words, not phrases. How this information should be added is an open question. Two possible ways would be to: Insert tuples/collocations extracted by some kind of statistics, for example based on entropy Use a shallow dependency parser How this should be done is not at all clear, so many experiments will be needed. However, it is important to use extremely fast algorithms. At least one billion words should be possible to prepare in reasonable time (the magnitude of one day), which will limit the possible ways to add these phrases. ”

Stockholm: Dec -03Vector-based Semantic Analysis7 My thesis – purpose, cont. ” The different approaches may be evaluated using a trained vector model and: A typical IR test suite of queries, documents, and relevance information Texts with lists of manually selected keywords (multiword units included) The Test of English as a Foreign Language (TOEFL), which tests the ability of selecting synonyms from a set of alternatives An improved model could improve applications like: traditional information retrieval, keyword extraction, automatic thesaurus construction. ”

Stockholm: Dec -03Vector-based Semantic Analysis8 The traditional vector model  One dimension for each index term  A document is a vector in a very high dimensional space  The similarity between a document and a query is:  Gives us a degree of similarity instead of yes/no as for basic keyword search

Stockholm: Dec -03Vector-based Semantic Analysis9 The traditional vector model, cont.  Assumption used: all terms are unrelated  Could be fixed partially using different weights for each term  Still, we have a lot more dimensions than we want How should we decide the index terms? Similarity between terms are always 0 Very similar documents may have sim0 if they:  use a different vocabulary  don’t use the index terms

Stockholm: Dec -03Vector-based Semantic Analysis10 Latent semantic indexing (LSI)  Similar to factor analysis  Number of dimensions can be chosen as we like  We make some kind of projection from a vector space with all terms to the smaller dimensionality  Each dimension is a mix of terms  Impossible to know the meaning of the dimension

Stockholm: Dec -03Vector-based Semantic Analysis11 LSI, cont.  Distance between vectors is cosine just as before  Meaningful to calculate distance between all terms and/or documents  How can we do the projection?  There are some ways: Singular value decomposition Random indexing (Magnus Sahlgren) Neural nets, factor analysis, etc.

Stockholm: Dec -03Vector-based Semantic Analysis12 LSI, cont.  I prefer LSI since:  Michael W Berry 1992: “… This important result indicates that A k is the best k-rank approxi- mation (in a least squares sense) to the matrix A.  Leif 2003: What Berry says is that SVD gives the best projection from n to k dimensions, that is the projection that keep distances in the best possible way.

Stockholm: Dec -03Vector-based Semantic Analysis13 SVD vs. Random indexing  SVD Mathematically complicated (based on eigenvalues) way to find an optimal vector space in a specific number of dimensions Computationally heavy - maybe 20 hours for a one million documents newspaper corpus Often uses the entire document as context  Random indexing Select some dimensions randomly Not as heavy to calculate, but more unclear (for me) why it works Uses a small context, typically 1+1 – 5+5 words

Stockholm: Dec -03Vector-based Semantic Analysis14 Random indexing  Magnus will tell you more next year!

Stockholm: Dec -03Vector-based Semantic Analysis15 A toy example to demonstrate SVD

Stockholm: Dec -03Vector-based Semantic Analysis16 How does SVD work?  In the traditional vector model: Relevant terms are the terms present in the document The term “trees” seems relevant to the m-documents, but is not present in m4 The “relevant”-relation is not transitive cos(m1,m4)=0 just as cos(c1,m3)=0  We get the following matrices from the SVD

Stockholm: Dec -03Vector-based Semantic Analysis17 What SVD gives us X=T 0 S 0 D 0 : X, T 0, S 0, D 0 are matrices

Stockholm: Dec -03Vector-based Semantic Analysis18 Using the SVD  The matrices make it easy to project term and document vectors into a m-dimensional space (m ≤ min (terms, docs)) using ordinary linear algebra  We can select m easily just by using as many rows/columns of T 0, S 0, D 0 as we want  To get an idea, let’s use m=2 and recalculate a new (approximated) X – it will still be a t x d matrix

We can recalculate ^X with m=2 C1C2C3C4C5M1M2M3M4 Human Interface Computer User System Response Time EPS Survey Trees Graph Minors

Stockholm: Dec -03Vector-based Semantic Analysis20 The ”new” X  Less space in the vector space…  Much less orthogonal vector pairs  ”Trees” is now very relevant to M4  M1 and M4 seem similar now!  The “relevant” relation is a bit transitive  Relevant terms for a document may be present in the document

Stockholm: Dec -03Vector-based Semantic Analysis21 Some applications  Automatic generation of a domain specific thesaurus  Keyword extraction from documents  Find sets of similar documents in a collection  Find documents related to a given document or a set of terms

Stockholm: Dec -03Vector-based Semantic Analysis22 Problems and questions  How can we interpret the similarities as different kinds of relations?  How can we include document structure and phrases in the model?  Terms are not really terms, but just words  Ambiguous terms pollute the vector space  How could we find the optimal number of dimensions for the vector space?

Stockholm: Dec -03Vector-based Semantic Analysis23 An example based on newspaper articles stefan edberg edberg0.918 cincinnatis0.887 edbergs0.883 världsfemman0.883 stefans0.883 tennisspelarna0.863 stefan0.861 turneringsseger0.859 queensturneringen växjöspelaren0.852 grästurnering0.847 bengt johansson johansson0.852 johanssons0.704 bengt0.678 centerledare0.674 miljöcentern0.667 landsbygdscentern0.667 implikationer0.645 ickesocialistisk0.643 centerledaren0.627 regeringsalternativet vagare0.616

Stockholm: Dec -03Vector-based Semantic Analysis24 Bengt Johansson is just Bengt + Johansson – something is missing! bengt1.000 westerberg0.912 folkpartiledaren0.899 westerbergs0.893 fpledaren0.864 socialminister0.862 försvarsfrågorna0.860 socialministern0.841 måndagsresor0.840 bulldozer0.838 skattesubventionerade barnomsorgsgaranti0.829 johansson1.000 johanssons0.800 olof0.684 centerledaren0.673 valperiod0.668 centerledarens0.654 betongpolitiken0.650 downhill0.640 centerfamiljen0.635 centerinflytande0.634 brokrisen0.632 gödslet0.628

Stockholm: Dec -03Vector-based Semantic Analysis25 A small experiment  I want the model to know the difference between Bengt and Bengt 1.Make a frequency list for all n-tuples up to n=5 with a frequency>1 2.Keep all words in the bags, but add the tuples, with space replaced by -, as words 3.Run the LSI again  Now bengt-johansson is a word, and bengt- johansson is NOT Bengt + Johansson Number of terms grows a lot!

Stockholm: Dec -03Vector-based Semantic Analysis26 And the top list for Bengt-Johansson bengt-johansson1.000 dubbellandskamperna pettersson-sävehof0.952 kristina-jönsson0.950 fanns-svenska-glädjeämnen johan-pettersson-sävehof martinsson-karlskrona förbundskaptenen-bengt-bengan-johansson förbundskaptenen-bengt-bengan sjumålsskytt0.931 svenska-damhandbollslandslaget stankiewicz0.926 em-par0.925 västeråslaget0.923 jan-stankiewicz0.923 handbollslandslag0.922 bengt-johansson-tt0.921 st-petersburg-sverige0.921 petersburg-sverige0.921 sjuklistan0.920 olsson-givetvis0.920 … johansson0.567 bengt0.354 olof0.181 centerledaren0.146 westerberg0.061 folkpartiledaren0.052

Stockholm: Dec -03Vector-based Semantic Analysis27 The new vector space model  It is clear that it is now possible to find terms closely related to Bengt Johansson – the handball coach  But is the model better for single words and for document comparison as well? What do you think?  More “words” than before – hopefully it improves the result just as more data does  At least no reason for a worse result... Or?

Stockholm: Dec -03Vector-based Semantic Analysis28 An example document REGERINGSKRIS ELLER INTE PARTILEDARNA I SISTAMINUTEN ÖVERLÄGGNINGAR OM BRON Under onsdagskvällen satt partiledarna i regeringen i sista minutenöverläggningar om Öresundsbron Centerledaren Olof Johansson var den förste som lämnade överläggningarna På torsdagen ska regeringen ge ett besked Det måste dock enligt statsminister Carl Bildt inte innebära ett ja eller ett nej till bron …

Stockholm: Dec -03Vector-based Semantic Analysis29 Closest terms in each model 0.986underkänner 0.982irhammar 0.977partiledarna 0.970godkände 0.962delade-meningar 0.960regeringssammanträde 0.957riksdagsledamot 0.957bengt-westerberg 0.954materialet 0.952diskuterade 0.950folkpartiledaren 0.949medierna 0.947motsättningarna 0.946vilar socialminister-bengt-westerberg 0.967partiledarna 0.921miljökrav 0.921underkänner 0.918tolkar 0.897meningar 0.888centerledaren 0.886regeringssammanträde 0.880slottet 0.880rosenbad 0.877planminister 0.866folkpartiledaren 0.855thurdin 0.845brokonsortiet 0.839görel 0.826irhammar

Stockholm: Dec -03Vector-based Semantic Analysis30 Closest document in both models BILDT LOVAR BESKED OCH REGERINGSKRIS HOTAR Det blir ett besked under torsdagen men det måste inte innebära ett ja eller nej från regeringen till Öresundsbroprojektet Detta löfte framförde statsminister Carl Bildt under onsdagen i ett antal varianter Samtidigt skärptes tonen mellan honom och miljöminister Olof Johansson och stämningen tydde på annalkande regeringskris De båda har under den långa broprocessen undvikit att uttala sig kritiskt om varandra och därmed trappa upp motsättningarna Men nu menar Bildt att centern lämnar sned information utåt Johansson och planminister Görel Thurdin anser å andra sidan att regeringen bara kan säga nej till bron om man tar riktig hänsyn till underlaget för miljöprövningen …

Stockholm: Dec -03Vector-based Semantic Analysis31 DocBasic modelTuples added ScoreRankScoreRank

Stockholm: Dec -03Vector-based Semantic Analysis32 Documents with better ranking in the basic model BRON KAN BLI VALFRÅGA SÄGER JOHANSSON Om det lutar åt ett ja i regeringen av politiska skäl då är naturligtvis den här frågan en viktig valfråga … INTE EN KRITISK RÖST BLAND CENTERPARTISTERNA TILL BROBESKEDET En etappseger för miljön och centern En eloge till Olof Johansson Görel Thurdin och Carl Bildt …

Stockholm: Dec -03Vector-based Semantic Analysis33 Documents with better ranking in the tuple model ALF SVENSSON TOPPNAMN I STOCKHOLM Kds- ledaren Alf Svensson toppar kds riksdagslista för Stockholms stad och Michael Stjernström sakkunnig i statsrådsberedningen har en valbar andra plats … BENGT WESTERBERG BARNPORREN MÅSTE STOPPAS Folkpartiledaren Bengt Westerberg lovade på onsdagen att regeringen ska göra allt för att stoppa barnporren …

Stockholm: Dec -03Vector-based Semantic Analysis34 Hmm, adding n-grams was maybe too simple... 1.If the bad result is due to overtraining, it could help to remove the words I build phrases from… 2.Another way to try is to use a dependency parser to find more meaningful phrases, not just n- grams A new test following 1 above:

Stockholm: Dec -03Vector-based Semantic Analysis35 Ok, the words inside tuples are now removed bengt-johansson1.000 tomas-svensson0.931 sveriges-handbollslandslag förbundskapten-bengt-johansson handboll0.897 svensk-handboll0.896 handbollsem0.894 carlen0.883 lagkaptenen-carlen0.869 förbundskapten-johansson ola-lindgren0.863 bengan-johansson0.862 mats-olsson0.854 carlen-magnus-wislander handbollens0.851 magnus-andersson0.851 halvlek-svenskarna0.849 teka-santander0.849 storskyttarna0.849 förbundskaptenen-bengt-johansson målvakten-mats-olsson danmark-tvåa0.843 handbollsspelare0.839 sveriges-handbollsherrar 0.836

Stockholm: Dec -03Vector-based Semantic Analysis36 And now pseudo documents are added for each tuple bengt-johansson1.000 förbundskapten-bengt-johansson förbundskaptenen-bengt-johansson jonas-johansson0.816 förbundskapten-johansson johanssons0.795 svenske-förbundskaptenen-bengt-johansson bengan0.786 carlen0.777 bengan-johansson0.767 johansson-andreas-dackell förlorat-matcherna0.750 ck-bure0.748 daniel-johansson0.748 målvakten-mats-olsson jörgen-jönsson-mikael-johansson kicki-johansson0.744 mattias-johansson-aik thomas-johansson0.739 handbollsnation0.738 mikael-johansson0.737 förbundskaptenen-bengt-johansson-valden johansson-mats-olsson sveriges-handbollslandslag ställningen-33-matcher 0.736

Stockholm: Dec -03Vector-based Semantic Analysis37 Evaluation – is the new models any better?  The different approaches may be evaluated using a trained vector model and: A typical IR test suite of queries, documents, and relevance information Texts with lists of manually selected keywords (multiword units included) The Test of English as a Foreign Language (TOEFL), which tests the ability of selecting synonyms from a set of alternatives

Stockholm: Dec -03Vector-based Semantic Analysis38 What next?  Joakim Nivre is developing a shallow dependency parser in Växjö  Selected dependencies could be used instead of raw tuples  It is much faster than tuple counting!  Will I get rid of the overtraining effects?  A separate problem: I don’t have a fully working LSI/SVD package

Stockholm: Dec -03Vector-based Semantic Analysis39 Other interesting questions/tasks  Understand what similarity means in a vector space model  Try to interpret various relations from similarities in a vector space model  Try to solve the “number of optimal dimensions”-problem  Explore what the length of the vectors mean

Stockholm: Dec -03Vector-based Semantic Analysis40 The End!  Questions?