Presentation laddar. Vänta.

Presentation laddar. Vänta.

Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist1 Latent Semantic Indexing and Beyond Leif Grönqvist School of Mathematics and Systems.

Liknande presentationer


En presentation över ämnet: "Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist1 Latent Semantic Indexing and Beyond Leif Grönqvist School of Mathematics and Systems."— Presentationens avskrift:

1 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist1 Latent Semantic Indexing and Beyond Leif Grönqvist (lgr@msi.vxu.se) School of Mathematics and Systems Engineering The Swedish Graduate School of Language Technology

2 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist2 What is Latent Semantic Indexing? LSI uses a kind of vector model The classical IR vector model groups documents with many terms in common But –Documents could have a very similar content, using different vocabularies –The terms used in the document may not be the most representative LSI uses the distribution of all terms in all documents when comparing two documents!

3 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist3 A traditional vector model for IR The starting point is a term-document-matrix, both for the traditional vector model and LSI We can calculate similarities between terms or documents using the cosine We can also (trivially) find relevant terms for a document Problems: –The term “trees” seems relevant to the m-documents, but is not present in m4 –cos(c1,c5)=0 just as cos(c1,m3)=0

4 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist4 A toy example

5 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist5 How does LSI work? The idea is to try to use latent information like: –word 1 and word 2 are often found together, so maybe doc 1 (containing word 1 ) and doc 2 (containing word 2 ) are related? –doc 3 and doc 4 have many words in common so maybe the words they don’t have in common are related?

6 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist6 How does LSI work? cont’d In the classical vector model, a document vector (from our toy example) is 12-dimensional and the term vectors are 9-dimensional What we want to do is to project these vector into a vector space with lower dimensionality One way is to use Singular Value Decomposition (SVD) We decompose the original matrix into three new matrices

7 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist7 What SVD gives us X=T 0 S 0 D 0 : X, T 0, S 0, D 0 are matrices

8 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist8 Using the SVD The matrices make it easy to project term and document vectors into a m-dimensional space (m ≤ min (terms, docs)) using ordinary linear algebra We can select m easily just by using as many rows/columns of T 0, S 0, D 0 as we want To get an idea, let’s use m=2 and recalculate a new (approximated) X – it will still be a t x d matrix

9 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist9 We can recalculate X with m=2 C1C2C3C4C5M1M2M3M4 Human.16.40.38.47.18-.05-.12-.16-.09 Interface.14.37.33.40.16-.03-.07-.10-.04 Computer.15.51.36.41.24.02.06.09.12 User.26.84.61.70.39.03.08.12.19 System.451.23 1.051.27.56-.07-.15-.21-.05 Response.16.58.38.42.28.06.13.19.22 Time.16.58.38.42.28.06.13.19.22 EPS.22.55.51.63.24-.07-.14-.20-.11 Survey.10.53.23.21.27.14.44.42 Trees-.06.23 -.14-.27.14.24.77.66 Graph-.06.34 -.15-.30.20.31.98.85 Minors-.04.25 -.10-.21.15.22.71.62

10 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist10 What does the SVD give? Susan Dumais 1995: “The SVD program takes the ltc transformed term-document matrix as input, and calculates the best "reduced-dimension" approximation to this matrix.” Michael W Berry 1992: “This important result indicates that A k is the best k-rank approximation (in at least squares sense) to the matrix A. Leif 2003: What Berry says is that SVD gives the best projection from n to k dimensions, that is the projection that keep distances in the best possible way.

11 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist11 Algorithms for dimensional reduction Singular Value Decomposition (SVD) –This is a mathematically complicated (based on eigen- values) way to find an optimal vector space in a specific number of dimensions –Computationally heavy - maybe 20 hours for a one million documents newspaper corpus –Uses often the entire document as context Random Indexing (RI) –Select some dimensions randomly –Not as heavy to calculate, but more unclear (for me) why it works –Uses a small context, typically 1+1 – 5+5 words Neural nets, Hyperspace Analogue to Language, etc.

12 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist12 Some applications Automatic generation of a domain specific thesaurus Keyword extraction from documents Find sets of similar documents in a collection Find documents related to a given document or a set of terms

13 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist13 Problems and questions How can we interpret the similarities as different kinds of relations? How can we include document structure and phrases in the model? Terms are not really terms, but just words Ambiguous terms pollute the vector space How could we find the optimal number of dimensions for the vector space?

14 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist14 An example based on 50 000 newspaper articles stefan edberg edberg0.918 cincinnatis0.887 edbergs0.883 världsfemman0.883 stefans0.883 tennisspelarna0.863 stefan0.861 turneringsseger0.859 queensturneringen 0.858 växjöspelaren0.852 grästurnering0.847 bengt johansson johansson0.852 johanssons0.704 bengt0.678 centerledare0.674 miljöcentern0.667 landsbygdscentern0.667 implikationer0.645 ickesocialistisk0.643 centerledaren0.627 regeringsalternativet 0.620 vagare0.616

15 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist15 Bengt Johansson is just Bengt + Johansson – something is missing! bengt1.000 westerberg0.912 folkpartiledaren0.899 westerbergs0.893 fpledaren0.864 socialminister0.862 försvarsfrågorna0.860 socialministern0.841 måndagsresor0.840 bulldozer0.838 skattesubventionerade 0.833 barnomsorgsgaranti0.829 johansson1.000 johanssons0.800 olof0.684 centerledaren0.673 valperiod0.668 centerledarens0.654 betongpolitiken0.650 downhill0.640 centerfamiljen0.635 centerinflytande0.634 brokrisen0.632 gödslet0.628

16 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist16 A small experiment I want the model to know the difference between Bengt and Bengt 1.Make a frequency list for all n-tuples up to n=5 with a frequency>1 2.Keep all words in the bags, but add the tuples, with space replaced by -, as words 3.Run the LSI again Now bengt-johansson is a word, and bengt- johansson is NOT Bengt + Johansson Number of terms grows a lot!

17 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist17 And the top list for Bengt-Johansson bengt-johansson1.000 dubbellandskamperna0.954 pettersson-sävehof0.952 kristina-jönsson0.950 fanns-svenska-glädjeämnen 0.945 johan-pettersson-sävehof 0.942 martinsson-karlskrona0.938 förbundskaptenen-bengt-bengan-johansson 0.932 förbundskaptenen-bengt-bengan 0.932 sjumålsskytt0.931 svenska-damhandbollslandslaget 0.928 stankiewicz0.926 em-par0.925 västeråslaget0.923 jan-stankiewicz0.923 handbollslandslag0.922 bengt-johansson-tt0.921 st-petersburg-sverige0.921 petersburg-sverige0.921 sjuklistan0.920 olsson-givetvis0.920 emtruppen0.919 … johansson0.567 bengt0.354 olof0.181 centerledaren0.146 westerberg0.061 folkpartiledaren0.052

18 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist18 The new vector space model It is clear that it is now possible to find terms closely related to Bengt Johansson – the handball coach But is the model better for single words and for document comparison as well? What do you think? More “words” than before – hopefully it improves the result just as more data does At least no reason for a worse result... Or?

19 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist19 An example document REGERINGSKRIS ELLER INTE PARTILEDARNA I SISTAMINUTEN ÖVERLÄGGNINGAR OM BRON Under onsdagskvällen satt partiledarna i regeringen i sista minutenöverläggningar om Öresundsbron Centerledaren Olof Johansson var den förste som lämnade överläggningarna På torsdagen ska regeringen ge ett besked Det måste dock enligt statsminister Carl Bildt inte innebära ett ja eller ett nej till bron …

20 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist20 Closest terms in each model 0.986underkänner 0.982irhammar 0.977partiledarna 0.970godkände 0.962delade-meningar 0.960regeringssammanträde 0.957riksdagsledamot 0.957bengt-westerberg 0.954materialet 0.952diskuterade 0.950folkpartiledaren 0.949medierna 0.947motsättningarna 0.946vilar 0.944 socialminister-bengt-westerberg 0.967partiledarna 0.921miljökrav 0.921underkänner 0.918tolkar 0.897meningar 0.888centerledaren 0.886regeringssammanträde 0.880slottet 0.880rosenbad 0.877planminister 0.866folkpartiledaren 0.855thurdin 0.845brokonsortiet 0.839görel 0.826irhammar

21 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist21 Closest document in both models BILDT LOVAR BESKED OCH REGERINGSKRIS HOTAR Det blir ett besked under torsdagen men det måste inte innebära ett ja eller nej från regeringen till Öresundsbroprojektet Detta löfte framförde statsminister Carl Bildt under onsdagen i ett antal varianter Samtidigt skärptes tonen mellan honom och miljöminister Olof Johansson och stämningen tydde på annalkande regeringskris De båda har under den långa broprocessen undvikit att uttala sig kritiskt om varandra och därmed trappa upp motsättningarna Men nu menar Bildt att centern lämnar sned information utåt Johansson och planminister Görel Thurdin anser å andra sidan att regeringen bara kan säga nej till bron om man tar riktig hänsyn till underlaget för miljöprövningen …

22 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist22 DocBasic modelTuples added ScoreRankScoreRank 21261.0001 1 2127.9962.9992 2128.8485.6773 3767.8493.5347 211.8058.5268 156.8446.5259 215.8059.52210 2602.8484.49212 2367.80410.43419 2360.8387.40223 3481.52753.6734 1567.45673.6015 1371.45673.6015

23 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist23 Documents with better ranking in the tuple model 2602.848 4.492 12 BRON KAN BLI VALFRÅGA SÄGER JOHANSSON Om det lutar åt ett ja i regeringen av politiska skäl då är naturligtvis den här frågan en viktig valfråga … 2367.804 10.434 19 INTE EN KRITISK RÖST BLAND CENTERPARTISTERNA TILL BROBESKEDET En etappseger för miljön och centern En eloge till Olof Johansson Görel Thurdin och Carl Bildt …

24 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist24 Documents with better ranking in the phrase model 1567.456 73.601 5 ALF SVENSSON TOPPNAMN I STOCKHOLM Kds-ledaren Alf Svensson toppar kds riksdagslista för Stockholms stad och Michael Stjernström sakkunnig i statsrådsberedningen har en valbar andra plats … 1371.456 74.601 6 BENGT WESTERBERG BARNPORREN MÅSTE STOPPAS Folkpartiledaren Bengt Westerberg lovade på onsdagen att regeringen ska göra allt för att stoppa barnporren …

25 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist25 Hmm, adding n-grams was maybe too simple... 1.If the bad result is due to overtraining, it could help to remove the words I build phrases from… 2.Another way to try is to use a dependency parser to find more meaningful phrases, not just n-grams A new test following 1 above:

26 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist26 Ok, the words inside tuples are now removed bengt-johansson1.000 tomas-svensson0.931 sveriges-handbollslandslag0.912 förbundskapten-bengt-johansson 0.898 handboll0.897 svensk-handboll0.896 handbollsem0.894 carlen0.883 lagkaptenen-carlen0.869 förbundskapten-johansson0.863 ola-lindgren0.863 bengan-johansson0.862 erik-hajas0.854 mats-olsson0.854 carlen-magnus-wislander 0.852 handbollens0.851 magnus-andersson0.851 halvlek-svenskarna0.849 teka-santander0.849 storskyttarna0.849 förbundskaptenen-bengt-johansson 0.845 målvakten-mats-olsson 0.845 danmark-tvåa0.843 handbollsspelare0.839 sveriges-handbollsherrar 0.836 lag-ibland0.835

27 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist27 And now pseudo documents are added for each tuple bengt-johansson1.000 förbundskapten-bengt-johansson 0.907 förbundskaptenen-bengt-johansson 0.835 jonas-johansson0.816 förbundskapten-johansson0.799 johanssons0.795 svenske-förbundskaptenen-bengt-johansson 0.792 bengan0.786 carlen0.777 bengan-johansson0.767 johansson-andreas-dackell0.765 förlorat-matcherna0.750 ck-bure0.748 daniel-johansson0.748 målvakten-mats-olsson0.747 jörgen-jönsson-mikael-johansson 0.744 kicki-johansson0.744 mattias-johansson-aik0.741 thomas-johansson0.739 handbollsnation0.738 mikael-johansson0.737 förbundskaptenen-bengt-johansson-valde 0.736 johansson-mats-olsson0.736 sveriges-handbollslandslag0.736 ställningen-33-matcher0.736

28 Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist28 What I still have to do something about Find a better LSI/SVD package than the one I have (old C-code from 1990), or maybe writing it myself... Get the phrases into the model in some way When these things are done I could: Try to interpret various relations from similarities in a vector space mode Try to solve the “number of optimal dimensions”-problem Explore what the length of the vectors mean


Ladda ner ppt "Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist1 Latent Semantic Indexing and Beyond Leif Grönqvist School of Mathematics and Systems."

Liknande presentationer


Google-annonser