Presentation laddar. Vänta.

Presentation laddar. Vänta.

Stockholm: Dec -03Vector-based Semantic Analysis1 Vector-based Semantic Analysis 1 Leif Grönqvist Växjö University (Mathematics and.

Liknande presentationer

En presentation över ämnet: "Stockholm: Dec -03Vector-based Semantic Analysis1 Vector-based Semantic Analysis 1 Leif Grönqvist Växjö University (Mathematics and."— Presentationens avskrift:

1 Stockholm: Dec -03Vector-based Semantic Analysis1 Vector-based Semantic Analysis 1 Leif Grönqvist ( Växjö University (Mathematics and Systems Engineering) GSLT (Graduate School of Language Technology) Göteborg University (Department of Linguistics)

2 Stockholm: Dec -03Vector-based Semantic Analysis2 Outline of the talk  My background and research interests  Vector space models in IR The traditional model Latent semantic indexing (LSI)  Singular value decomposition (SVD)  Weaknesses in the model and how to avoid or get rid of them  Experiments – extended LSI

3 Stockholm: Dec -03Vector-based Semantic Analysis3 My background  1986-1989: ”4-årig teknisk” (electrical engineering)  1989-1993: M.Sc. (official translation of “Filosofie Magister”) in Computing Science, Göteborg University  1989-1993: 62 points in mechanics, electronics, etc.  1994-2001: Work at the Linguistic department in Göteborg Various projects related to corpus linguistics Some teaching on statistical methods (Göteborg and Uppsala), Corpus linguistics in Göteborg, Sofia, Beijing, South Africa  1995: Consultant at Redwood Research, in Sollentuna, working on information retrieval in medical databases  1995-1996: Work at the department of Informatics in Göteborg (the Internet Project)  2001-2006: PhD Student in Computer Science / Language Technology in Växjö

4 Stockholm: Dec -03Vector-based Semantic Analysis4 Research interests  Statistical methods in computational linguistics Corpus linguistics Hidden Markov models Tagging, parsing, etc. Machine learning  Information retrieval Vector space models containing semantic information  Co-occurrence statistics  LSI  Adding more linguistic information Clustering  Finite state technology

5 Stockholm: Dec -03Vector-based Semantic Analysis5 My thesis – purpose  Will be finished in 2006 ” The major goal of this investigation is to improve the vector model obtained by LSI, using linguistic information that could be extracted automatically from raw textual data. The kind of improvements in mind for specific applications are: Give a search engine the capability of disambiguate ambiguous words, and between different persons with the same name. Make it possible for a keyword extractor to find not just words, but a list of relevant multi-word units, phrases and words. ”

6 Stockholm: Dec -03Vector-based Semantic Analysis6 My thesis – purpose, cont. ” The starting point is that multi-word units in the vector model could give the improvements above, since the model as it is includes only words, not phrases. How this information should be added is an open question. Two possible ways would be to: Insert tuples/collocations extracted by some kind of statistics, for example based on entropy Use a shallow dependency parser How this should be done is not at all clear, so many experiments will be needed. However, it is important to use extremely fast algorithms. At least one billion words should be possible to prepare in reasonable time (the magnitude of one day), which will limit the possible ways to add these phrases. ”

7 Stockholm: Dec -03Vector-based Semantic Analysis7 My thesis – purpose, cont. ” The different approaches may be evaluated using a trained vector model and: A typical IR test suite of queries, documents, and relevance information Texts with lists of manually selected keywords (multiword units included) The Test of English as a Foreign Language (TOEFL), which tests the ability of selecting synonyms from a set of alternatives An improved model could improve applications like: traditional information retrieval, keyword extraction, automatic thesaurus construction. ”

8 Stockholm: Dec -03Vector-based Semantic Analysis8 The traditional vector model  One dimension for each index term  A document is a vector in a very high dimensional space  The similarity between a document and a query is:  Gives us a degree of similarity instead of yes/no as for basic keyword search

9 Stockholm: Dec -03Vector-based Semantic Analysis9 The traditional vector model, cont.  Assumption used: all terms are unrelated  Could be fixed partially using different weights for each term  Still, we have a lot more dimensions than we want How should we decide the index terms? Similarity between terms are always 0 Very similar documents may have sim0 if they:  use a different vocabulary  don’t use the index terms

10 Stockholm: Dec -03Vector-based Semantic Analysis10 Latent semantic indexing (LSI)  Similar to factor analysis  Number of dimensions can be chosen as we like  We make some kind of projection from a vector space with all terms to the smaller dimensionality  Each dimension is a mix of terms  Impossible to know the meaning of the dimension

11 Stockholm: Dec -03Vector-based Semantic Analysis11 LSI, cont.  Distance between vectors is cosine just as before  Meaningful to calculate distance between all terms and/or documents  How can we do the projection?  There are some ways: Singular value decomposition Random indexing (Magnus Sahlgren) Neural nets, factor analysis, etc.

12 Stockholm: Dec -03Vector-based Semantic Analysis12 LSI, cont.  I prefer LSI since:  Michael W Berry 1992: “… This important result indicates that A k is the best k-rank approxi- mation (in a least squares sense) to the matrix A.  Leif 2003: What Berry says is that SVD gives the best projection from n to k dimensions, that is the projection that keep distances in the best possible way.

13 Stockholm: Dec -03Vector-based Semantic Analysis13 SVD vs. Random indexing  SVD Mathematically complicated (based on eigenvalues) way to find an optimal vector space in a specific number of dimensions Computationally heavy - maybe 20 hours for a one million documents newspaper corpus Often uses the entire document as context  Random indexing Select some dimensions randomly Not as heavy to calculate, but more unclear (for me) why it works Uses a small context, typically 1+1 – 5+5 words

14 Stockholm: Dec -03Vector-based Semantic Analysis14 Random indexing  Magnus will tell you more next year!

15 Stockholm: Dec -03Vector-based Semantic Analysis15 A toy example to demonstrate SVD

16 Stockholm: Dec -03Vector-based Semantic Analysis16 How does SVD work?  In the traditional vector model: Relevant terms are the terms present in the document The term “trees” seems relevant to the m-documents, but is not present in m4 The “relevant”-relation is not transitive cos(m1,m4)=0 just as cos(c1,m3)=0  We get the following matrices from the SVD

17 Stockholm: Dec -03Vector-based Semantic Analysis17 What SVD gives us X=T 0 S 0 D 0 : X, T 0, S 0, D 0 are matrices

18 Stockholm: Dec -03Vector-based Semantic Analysis18 Using the SVD  The matrices make it easy to project term and document vectors into a m-dimensional space (m ≤ min (terms, docs)) using ordinary linear algebra  We can select m easily just by using as many rows/columns of T 0, S 0, D 0 as we want  To get an idea, let’s use m=2 and recalculate a new (approximated) X – it will still be a t x d matrix

19 We can recalculate ^X with m=2 C1C2C3C4C5M1M2M3M4 Human. Interface. Computer. User. System.451.23 Response. Time. EPS. Survey. Trees-.06.23 -.14-. Graph-.06.34 -.15-. Minors-.04.25 -.10-.

20 Stockholm: Dec -03Vector-based Semantic Analysis20 The ”new” X  Less space in the vector space…  Much less orthogonal vector pairs  ”Trees” is now very relevant to M4  M1 and M4 seem similar now!  The “relevant” relation is a bit transitive  Relevant terms for a document may be present in the document

21 Stockholm: Dec -03Vector-based Semantic Analysis21 Some applications  Automatic generation of a domain specific thesaurus  Keyword extraction from documents  Find sets of similar documents in a collection  Find documents related to a given document or a set of terms

22 Stockholm: Dec -03Vector-based Semantic Analysis22 Problems and questions  How can we interpret the similarities as different kinds of relations?  How can we include document structure and phrases in the model?  Terms are not really terms, but just words  Ambiguous terms pollute the vector space  How could we find the optimal number of dimensions for the vector space?

23 Stockholm: Dec -03Vector-based Semantic Analysis23 An example based on 50 000 newspaper articles stefan edberg edberg0.918 cincinnatis0.887 edbergs0.883 världsfemman0.883 stefans0.883 tennisspelarna0.863 stefan0.861 turneringsseger0.859 queensturneringen 0.858 växjöspelaren0.852 grästurnering0.847 bengt johansson johansson0.852 johanssons0.704 bengt0.678 centerledare0.674 miljöcentern0.667 landsbygdscentern0.667 implikationer0.645 ickesocialistisk0.643 centerledaren0.627 regeringsalternativet 0.620 vagare0.616

24 Stockholm: Dec -03Vector-based Semantic Analysis24 Bengt Johansson is just Bengt + Johansson – something is missing! bengt1.000 westerberg0.912 folkpartiledaren0.899 westerbergs0.893 fpledaren0.864 socialminister0.862 försvarsfrågorna0.860 socialministern0.841 måndagsresor0.840 bulldozer0.838 skattesubventionerade 0.833 barnomsorgsgaranti0.829 johansson1.000 johanssons0.800 olof0.684 centerledaren0.673 valperiod0.668 centerledarens0.654 betongpolitiken0.650 downhill0.640 centerfamiljen0.635 centerinflytande0.634 brokrisen0.632 gödslet0.628

25 Stockholm: Dec -03Vector-based Semantic Analysis25 A small experiment  I want the model to know the difference between Bengt and Bengt 1.Make a frequency list for all n-tuples up to n=5 with a frequency>1 2.Keep all words in the bags, but add the tuples, with space replaced by -, as words 3.Run the LSI again  Now bengt-johansson is a word, and bengt- johansson is NOT Bengt + Johansson Number of terms grows a lot!

26 Stockholm: Dec -03Vector-based Semantic Analysis26 And the top list for Bengt-Johansson bengt-johansson1.000 dubbellandskamperna 0.954 pettersson-sävehof0.952 kristina-jönsson0.950 fanns-svenska-glädjeämnen 0.945 johan-pettersson-sävehof 0.942 martinsson-karlskrona 0.938 förbundskaptenen-bengt-bengan-johansson 0.932 förbundskaptenen-bengt-bengan 0.932 sjumålsskytt0.931 svenska-damhandbollslandslaget 0.928 stankiewicz0.926 em-par0.925 västeråslaget0.923 jan-stankiewicz0.923 handbollslandslag0.922 bengt-johansson-tt0.921 st-petersburg-sverige0.921 petersburg-sverige0.921 sjuklistan0.920 olsson-givetvis0.920 … johansson0.567 bengt0.354 olof0.181 centerledaren0.146 westerberg0.061 folkpartiledaren0.052

27 Stockholm: Dec -03Vector-based Semantic Analysis27 The new vector space model  It is clear that it is now possible to find terms closely related to Bengt Johansson – the handball coach  But is the model better for single words and for document comparison as well? What do you think?  More “words” than before – hopefully it improves the result just as more data does  At least no reason for a worse result... Or?

28 Stockholm: Dec -03Vector-based Semantic Analysis28 An example document REGERINGSKRIS ELLER INTE PARTILEDARNA I SISTAMINUTEN ÖVERLÄGGNINGAR OM BRON Under onsdagskvällen satt partiledarna i regeringen i sista minutenöverläggningar om Öresundsbron Centerledaren Olof Johansson var den förste som lämnade överläggningarna På torsdagen ska regeringen ge ett besked Det måste dock enligt statsminister Carl Bildt inte innebära ett ja eller ett nej till bron …

29 Stockholm: Dec -03Vector-based Semantic Analysis29 Closest terms in each model 0.986underkänner 0.982irhammar 0.977partiledarna 0.970godkände 0.962delade-meningar 0.960regeringssammanträde 0.957riksdagsledamot 0.957bengt-westerberg 0.954materialet 0.952diskuterade 0.950folkpartiledaren 0.949medierna 0.947motsättningarna 0.946vilar 0.944 socialminister-bengt-westerberg 0.967partiledarna 0.921miljökrav 0.921underkänner 0.918tolkar 0.897meningar 0.888centerledaren 0.886regeringssammanträde 0.880slottet 0.880rosenbad 0.877planminister 0.866folkpartiledaren 0.855thurdin 0.845brokonsortiet 0.839görel 0.826irhammar

30 Stockholm: Dec -03Vector-based Semantic Analysis30 Closest document in both models BILDT LOVAR BESKED OCH REGERINGSKRIS HOTAR Det blir ett besked under torsdagen men det måste inte innebära ett ja eller nej från regeringen till Öresundsbroprojektet Detta löfte framförde statsminister Carl Bildt under onsdagen i ett antal varianter Samtidigt skärptes tonen mellan honom och miljöminister Olof Johansson och stämningen tydde på annalkande regeringskris De båda har under den långa broprocessen undvikit att uttala sig kritiskt om varandra och därmed trappa upp motsättningarna Men nu menar Bildt att centern lämnar sned information utåt Johansson och planminister Görel Thurdin anser å andra sidan att regeringen bara kan säga nej till bron om man tar riktig hänsyn till underlaget för miljöprövningen …

31 Stockholm: Dec -03Vector-based Semantic Analysis31 DocBasic modelTuples added ScoreRankScoreRank 21261.0001 1 2127.9962.9992 2128.8485.6773 3767.8493.5347 211.8058.5268 156.8446.5259 215.8059.52210 2602.8484.49212 2367.80410.43419 2360.8387.40223 3481.52753.6734 1567.45673.6015 1371.45673.6015

32 Stockholm: Dec -03Vector-based Semantic Analysis32 Documents with better ranking in the basic model 2602.848 4.492 12 BRON KAN BLI VALFRÅGA SÄGER JOHANSSON Om det lutar åt ett ja i regeringen av politiska skäl då är naturligtvis den här frågan en viktig valfråga … 2367.804 10.434 19 INTE EN KRITISK RÖST BLAND CENTERPARTISTERNA TILL BROBESKEDET En etappseger för miljön och centern En eloge till Olof Johansson Görel Thurdin och Carl Bildt …

33 Stockholm: Dec -03Vector-based Semantic Analysis33 Documents with better ranking in the tuple model 1567.456 73.601 5 ALF SVENSSON TOPPNAMN I STOCKHOLM Kds- ledaren Alf Svensson toppar kds riksdagslista för Stockholms stad och Michael Stjernström sakkunnig i statsrådsberedningen har en valbar andra plats … 1371.456 74.601 6 BENGT WESTERBERG BARNPORREN MÅSTE STOPPAS Folkpartiledaren Bengt Westerberg lovade på onsdagen att regeringen ska göra allt för att stoppa barnporren …

34 Stockholm: Dec -03Vector-based Semantic Analysis34 Hmm, adding n-grams was maybe too simple... 1.If the bad result is due to overtraining, it could help to remove the words I build phrases from… 2.Another way to try is to use a dependency parser to find more meaningful phrases, not just n- grams A new test following 1 above:

35 Stockholm: Dec -03Vector-based Semantic Analysis35 Ok, the words inside tuples are now removed bengt-johansson1.000 tomas-svensson0.931 sveriges-handbollslandslag 0.912 förbundskapten-bengt-johansson 0.898 handboll0.897 svensk-handboll0.896 handbollsem0.894 carlen0.883 lagkaptenen-carlen0.869 förbundskapten-johansson 0.863 ola-lindgren0.863 bengan-johansson0.862 mats-olsson0.854 carlen-magnus-wislander 0.852 handbollens0.851 magnus-andersson0.851 halvlek-svenskarna0.849 teka-santander0.849 storskyttarna0.849 förbundskaptenen-bengt-johansson 0.845 målvakten-mats-olsson 0.845 danmark-tvåa0.843 handbollsspelare0.839 sveriges-handbollsherrar 0.836

36 Stockholm: Dec -03Vector-based Semantic Analysis36 And now pseudo documents are added for each tuple bengt-johansson1.000 förbundskapten-bengt-johansson 0.907 förbundskaptenen-bengt-johansson 0.835 jonas-johansson0.816 förbundskapten-johansson 0.799 johanssons0.795 svenske-förbundskaptenen-bengt-johansson 0.792 bengan0.786 carlen0.777 bengan-johansson0.767 johansson-andreas-dackell 0.765 förlorat-matcherna0.750 ck-bure0.748 daniel-johansson0.748 målvakten-mats-olsson 0.747 jörgen-jönsson-mikael-johansson 0.744 kicki-johansson0.744 mattias-johansson-aik 0.741 thomas-johansson0.739 handbollsnation0.738 mikael-johansson0.737 förbundskaptenen-bengt-johansson-valden 0.736 johansson-mats-olsson 0.736 sveriges-handbollslandslag 0.736 ställningen-33-matcher 0.736

37 Stockholm: Dec -03Vector-based Semantic Analysis37 Evaluation – is the new models any better?  The different approaches may be evaluated using a trained vector model and: A typical IR test suite of queries, documents, and relevance information Texts with lists of manually selected keywords (multiword units included) The Test of English as a Foreign Language (TOEFL), which tests the ability of selecting synonyms from a set of alternatives

38 Stockholm: Dec -03Vector-based Semantic Analysis38 What next?  Joakim Nivre is developing a shallow dependency parser in Växjö  Selected dependencies could be used instead of raw tuples  It is much faster than tuple counting!  Will I get rid of the overtraining effects?  A separate problem: I don’t have a fully working LSI/SVD package

39 Stockholm: Dec -03Vector-based Semantic Analysis39 Other interesting questions/tasks  Understand what similarity means in a vector space model  Try to interpret various relations from similarities in a vector space model  Try to solve the “number of optimal dimensions”-problem  Explore what the length of the vectors mean

40 Stockholm: Dec -03Vector-based Semantic Analysis40 The End!  Questions?

Ladda ner ppt "Stockholm: Dec -03Vector-based Semantic Analysis1 Vector-based Semantic Analysis 1 Leif Grönqvist Växjö University (Mathematics and."

Liknande presentationer