Search (68 results, page 1 of 4)

Kanaeva, Z.: Ranking: Google und CiteSeer (2005) 0.04

0.035985157 = product of:
  0.10795546 = sum of:
    0.10795546 = sum of:
      0.058770303 = weight(_text_:indexing in 3276) [ClassicSimilarity], result of:
        0.058770303 = score(doc=3276,freq=2.0), product of:
          0.1985171 = queryWeight, product of:
            3.8278677 = idf(docFreq=2614, maxDocs=44218)
            0.051861014 = queryNorm
          0.29604656 = fieldWeight in 3276, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.8278677 = idf(docFreq=2614, maxDocs=44218)
            0.0546875 = fieldNorm(doc=3276)
      0.04918516 = weight(_text_:22 in 3276) [ClassicSimilarity], result of:
        0.04918516 = score(doc=3276,freq=2.0), product of:
          0.18160844 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.051861014 = queryNorm
          0.2708308 = fieldWeight in 3276, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0546875 = fieldNorm(doc=3276)
  0.33333334 = coord(1/3)

Abstract: Im Rahmen des klassischen Information Retrieval wurden verschiedene Verfahren für das Ranking sowie die Suche in einer homogenen strukturlosen Dokumentenmenge entwickelt. Die Erfolge der Suchmaschine Google haben gezeigt dass die Suche in einer zwar inhomogenen aber zusammenhängenden Dokumentenmenge wie dem Internet unter Berücksichtigung der Dokumentenverbindungen (Links) sehr effektiv sein kann. Unter den von der Suchmaschine Google realisierten Konzepten ist ein Verfahren zum Ranking von Suchergebnissen (PageRank), das in diesem Artikel kurz erklärt wird. Darüber hinaus wird auf die Konzepte eines Systems namens CiteSeer eingegangen, welches automatisch bibliographische Angaben indexiert (engl. Autonomous Citation Indexing, ACI). Letzteres erzeugt aus einer Menge von nicht vernetzten wissenschaftlichen Dokumenten eine zusammenhängende Dokumentenmenge und ermöglicht den Einsatz von Banking-Verfahren, die auf den von Google genutzten Verfahren basieren.
Date: 20. 3.2005 16:23:22

Burgin, R.: ¬The retrieval effectiveness of 5 clustering algorithms as a function of indexing exhaustivity (1995) 0.04
```
0.03594722 = product of:
  0.10784165 = sum of:
    0.10784165 = sum of:
      0.07270939 = weight(_text_:indexing in 3365) [ClassicSimilarity], result of:
        0.07270939 = score(doc=3365,freq=6.0), product of:
          0.1985171 = queryWeight, product of:
            3.8278677 = idf(docFreq=2614, maxDocs=44218)
            0.051861014 = queryNorm
          0.3662626 = fieldWeight in 3365, product of:
            2.4494898 = tf(freq=6.0), with freq of:
              6.0 = termFreq=6.0
            3.8278677 = idf(docFreq=2614, maxDocs=44218)
            0.0390625 = fieldNorm(doc=3365)
      0.03513226 = weight(_text_:22 in 3365) [ClassicSimilarity], result of:
        0.03513226 = score(doc=3365,freq=2.0), product of:
          0.18160844 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.051861014 = queryNorm
          0.19345059 = fieldWeight in 3365, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0390625 = fieldNorm(doc=3365)
  0.33333334 = coord(1/3)
```
Abstract

The retrieval effectiveness of 5 hierarchical clustering methods (single link, complete link, group average, Ward's method, and weighted average) is examined as a function of indexing exhaustivity with 4 test collections (CR, Cranfield, Medlars, and Time). Evaluations of retrieval effectiveness, based on 3 measures of optimal retrieval performance, confirm earlier findings that the performance of a retrieval system based on single link clustering varies as a function of indexing exhaustivity but fail ti find similar patterns for other clustering methods. The data also confirm earlier findings regarding the poor performance of single link clustering is a retrieval environment. The poor performance of single link clustering appears to derive from that method's tendency to produce a small number of large, ill defined document clusters. By contrast, the data examined here found the retrieval performance of the other clustering methods to be general comparable. The data presented also provides an opportunity to examine the theoretical limits of cluster based retrieval and to compare these theoretical limits to the effectiveness of operational implementations. Performance standards of the 4 document collections examined were found to vary widely, and the effectiveness of operational implementations were found to be in the range defined as unacceptable. Further improvements in search strategies and document representations warrant investigations

Date

22. 2.1996 11:20:06
Mandl, T.: Tolerantes Information Retrieval : Neuronale Netze zur Erhöhung der Adaptivität und Flexibilität bei der Informationssuche (2001) 0.03
```
0.03469476 = product of:
  0.05204214 = sum of:
    0.043646384 = weight(_text_:systematik in 5965) [ClassicSimilarity], result of:
      0.043646384 = score(doc=5965,freq=2.0), product of:
        0.32005686 = queryWeight, product of:
          6.1714344 = idf(docFreq=250, maxDocs=44218)
          0.051861014 = queryNorm
        0.13637072 = fieldWeight in 5965, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.1714344 = idf(docFreq=250, maxDocs=44218)
          0.015625 = fieldNorm(doc=5965)
    0.0083957575 = product of:
      0.016791515 = sum of:
        0.016791515 = weight(_text_:indexing in 5965) [ClassicSimilarity], result of:
          0.016791515 = score(doc=5965,freq=2.0), product of:
            0.1985171 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051861014 = queryNorm
            0.08458473 = fieldWeight in 5965, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.015625 = fieldNorm(doc=5965)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Footnote

Rez. in: nfd - Information 54(2003) H.6, S.379-380 (U. Thiel): "Kannte G. Salton bei der Entwicklung des Vektorraummodells die kybernetisch orientierten Versuche mit assoziativen Speicherstrukturen? An diese und ähnliche Vermutungen, die ich vor einigen Jahren mit Reginald Ferber und anderen Kollegen diskutierte, erinnerte mich die Thematik des vorliegenden Buches. Immerhin lässt sich feststellen, dass die Vektorrepräsentation eine genial einfache Darstellung sowohl der im Information Retrieval (IR) als grundlegende Datenstruktur benutzten "inverted files" als auch der assoziativen Speichermatrizen darstellt, die sich im Laufe der Zeit Über Perzeptrons zu Neuronalen Netzen (NN) weiterentwickelten. Dieser formale Zusammenhang stimulierte in der Folge eine Reihe von Ansätzen, die Netzwerke im Retrieval zu verwenden, wobei sich, wie auch im vorliegenden Band, hybride Ansätze, die Methoden aus beiden Disziplinen kombinieren, als sehr geeignet erweisen. Aber der Reihe nach... Das Buch wurde vom Autor als Dissertation beim Fachbereich IV "Sprachen und Technik" der Universität Hildesheim eingereicht und resultiert aus einer Folge von Forschungsbeiträgen zu mehreren Projekten, an denen der Autor in der Zeit von 1995 bis 2000 an verschiedenen Standorten beteiligt war. Dies erklärt die ungewohnte Breite der Anwendungen, Szenarien und Domänen, in denen die Ergebnisse gewonnen wurden. So wird das in der Arbeit entwickelte COSIMIR Modell (COgnitive SIMilarity learning in Information Retrieval) nicht nur anhand der klassischen Cranfield-Kollektion evaluiert, sondern auch im WING-Projekt der Universität Regensburg im Faktenretrieval aus einer Werkstoffdatenbank eingesetzt. Weitere Versuche mit der als "Transformations-Netzwerk" bezeichneten Komponente, deren Aufgabe die Abbildung von Gewichtungsfunktionen zwischen zwei Termräumen ist, runden das Spektrum der Experimente ab. Aber nicht nur die vorgestellten Resultate sind vielfältig, auch der dem Leser angebotene "State-of-the-Art"-Überblick fasst in hoch informativer Breite Wesentliches aus den Gebieten IR und NN zusammen und beleuchtet die Schnittpunkte der beiden Bereiche. So werden neben den Grundlagen des Text- und Faktenretrieval die Ansätze zur Verbesserung der Adaptivität und zur Beherrschung von Heterogenität vorgestellt, während als Grundlagen Neuronaler Netze neben einer allgemeinen Einführung in die Grundbegriffe u.a. das Backpropagation-Modell, KohonenNetze und die Adaptive Resonance Theory (ART) geschildert werden. Einweiteres Kapitel stellt die bisherigen NN-orientierten Ansätze im IR vor und rundet den Abriss der relevanten Forschungslandschaft ab. Als Vorbereitung der Präsentation des COSIMIR-Modells schiebt der Autor an dieser Stelle ein diskursives Kapitel zum Thema Heterogenität im IR ein, wodurch die Ziele und Grundannahmen der Arbeit noch einmal reflektiert werden. Als Dimensionen der Heterogenität werden der Objekttyp, die Qualität der Objekte und ihrer Erschließung und die Mehrsprachigkeit genannt. Wenn auch diese Systematik im Wesentlichen die Akzente auf Probleme aus den hier tangierten Projekten legt, und weniger eine umfassende Aufbereitung z.B. der Literatur zum Problem der Relevanz anstrebt, ist sie dennoch hilfreich zum Verständnis der in den nachfolgenden Kapitel oft nur implizit angesprochenen Designentscheidungen bei der Konzeption der entwickelten Prototypen. Der Ansatz, Heterogenität durch Transformationen zu behandeln, wird im speziellen Kontext der NN konkretisiert, wobei andere Möglichkeiten, die z.B. Instrumente der Logik und Probabilistik einzusetzen, nur kurz diskutiert werden. Eine weitergehende Analyse hätte wohl auch den Rahmen der Arbeit zu weit gespannt,
da nun nach fast 200 Seiten der Hauptteil der Dissertation folgt - die Vorstellung und Bewertung des bereits erwähnten COSIMIR Modells. Das COSIMIR Modell "berechnet die Ähnlichkeit zwischen den zwei anliegenden Input-Vektoren" (P.194). Der Output des Netzwerks wird an einem einzigen Knoten abgegriffen, an dem sich ein sogenannten Relevanzwert einstellt, wenn die Berechnungen der Gewichtungen interner Knoten zum Abschluss kommen. Diese Gewichtungen hängen von den angelegten Inputvektoren, aus denen die Gewichte der ersten Knotenschicht ermittelt werden, und den im Netzwerk vorgegebenen Kantengewichten ab. Die Gewichtung von Kanten ist der Kernpunkt des neuronalen Ansatzes: In Analogie zum biologischen Urbild (Dendrit mit Synapsen) wächst das Gewicht der Kante mit jeder Aktivierung während einer Trainingsphase. Legt man in dieser Phase zwei Inputvektoren, z.B. Dokumentvektor und Ouery gleichzeitig mit dem Relevanzurteil als Wert des Outputknoten an, verteilen sich durch den BackpropagationProzess die Gewichte entlang der Pfade, die zwischen den beteiligten Knoten bestehen. Da alle Knoten miteinander verbunden sind, entstehen nach mehreren Trainingsbeispielen bereits deutlich unterschiedliche Kantengewichte, weil die aktiv beteiligten Kanten die Änderungen akkumulativ speichern. Eine Variation des Verfahrens benutzt das NN als "Transformationsnetzwerk", wobei die beiden Inputvektoren mit einer Dokumentrepräsentation und einem dazugehörigen Indexat (von einem Experten bereitgestellt) belegt werden. Neben der schon aufgezeigten Trainingsnotwendigkeit weisen die Neuronalen Netze eine weitere intrinsische Problematik auf: Je mehr äußere Knoten benötigt werden, desto mehr interne Kanten (und bei der Verwendung von Zwischenschichten auch Knoten) sind zu verwalten, deren Anzahl nicht linear wächst. Dieser algorithmische Befund setzt naiven Einsätzen der NN-Modelle in der Praxis schnell Grenzen, deshalb ist es umso verdienstvoller, dass der Autor einen innovativen Weg zur Lösung des Problems mit den Mitteln des IR vorschlagen kann. Er verwendet das Latent Semantic Indexing, welches Dokumentrepräsentationen aus einem hochdimensionalen Vektorraum in einen niederdimensionalen abbildet, um die Anzahl der Knoten deutlich zu reduzieren. Damit ist eine sehr schöne Synthese gelungen, welche die eingangs angedeuteten formalen Übereinstimmungen zwischen Vektorraummodellen im IR und den NN aufzeigt und ausnutzt.
Kelledy, F.; Smeaton, A.F.: Signature files and beyond (1996) 0.03
```
0.03084442 = product of:
  0.09253326 = sum of:
    0.09253326 = sum of:
      0.05037455 = weight(_text_:indexing in 6973) [ClassicSimilarity], result of:
        0.05037455 = score(doc=6973,freq=2.0), product of:
          0.1985171 = queryWeight, product of:
            3.8278677 = idf(docFreq=2614, maxDocs=44218)
            0.051861014 = queryNorm
          0.2537542 = fieldWeight in 6973, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.8278677 = idf(docFreq=2614, maxDocs=44218)
            0.046875 = fieldNorm(doc=6973)
      0.042158708 = weight(_text_:22 in 6973) [ClassicSimilarity], result of:
        0.042158708 = score(doc=6973,freq=2.0), product of:
          0.18160844 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.051861014 = queryNorm
          0.23214069 = fieldWeight in 6973, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.046875 = fieldNorm(doc=6973)
  0.33333334 = coord(1/3)
```
Abstract

Proposes that signature files be used as a viable alternative to other indexing strategies such as inverted files for searching through large volumes of text. Demonstrates through simulation, that search times can be further reduced by enhancing the basic signature file concept using deterministic partitioning algorithms which eliminate the need for an exhaustive search of the entire signature file. Reports research to evaluate the performance of some deterministic partitioning algorithms in a non simulated environment using 276 MB of raw newspaper text (taken from the Wall Street Journal) and real user queries. Presents a selection of results to illustrate trends and highlight important aspects of the performance of these methods under realistic rather than simulated operating conditions. As a result of the research reported here certain aspects of this approach to signature files are shown to be found wanting and require improvement. Suggests lines of future research on the partitioning of signature files

Source

Information retrieval: new systems and current research. Proceedings of the 16th Research Colloquium of the British Computer Society Information Retrieval Specialist Group, Drymen, Scotland, 22-23 Mar 94. Ed.: R. Leon

Chang, R.: Keyword searching and indexing (1993) 0.03

0.025031313 = product of:
  0.07509394 = sum of:
    0.07509394 = product of:
      0.15018788 = sum of:
        0.15018788 = weight(_text_:indexing in 7223) [ClassicSimilarity], result of:
          0.15018788 = score(doc=7223,freq=10.0), product of:
            0.1985171 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051861014 = queryNorm
            0.7565488 = fieldWeight in 7223, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.0625 = fieldNorm(doc=7223)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Abstract: Explains how a computer indexing system works. Reviews fundamentals of how data are stored and retrieved by computers. Describes B-Tree and B+-Tree indexing structures. Gives basic keyword searching techniques that the user must apply to make use of the indexing programs. The demand for keyword retrieval is increasing and librarians should expect to see the keyword-indexing feature become commonly available

Abdelkareem, M.A.A.: In terms of publication index, what indicator is the best for researchers indexing, Google Scholar, Scopus, Clarivate or others? (2018) 0.02
```
0.021902401 = product of:
  0.0657072 = sum of:
    0.0657072 = product of:
      0.1314144 = sum of:
        0.1314144 = weight(_text_:indexing in 4548) [ClassicSimilarity], result of:
          0.1314144 = score(doc=4548,freq=10.0), product of:
            0.1985171 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051861014 = queryNorm
            0.6619802 = fieldWeight in 4548, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.0546875 = fieldNorm(doc=4548)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)
```
Abstract

I believe that Google Scholar is the most popular academic indexing way for researchers and citations. However, some other indexing institutions may be more professional than Google Scholar but not as popular as Google Scholar. Other indexing websites like Scopus and Clarivate are providing more statistical figures for scholars, institutions or even journals. On account of publication citations, always Google Scholar shows higher citations for a paper than other indexing websites since Google Scholar consider most of the publication platforms so he can easily count the citations. While other databases just consider the citations come from those journals that are already indexed in their database

Voorhees, E.M.: Implementing agglomerative hierarchic clustering algorithms for use in document retrieval (1986) 0.02

0.018737204 = product of:
  0.056211613 = sum of:
    0.056211613 = product of:
      0.112423226 = sum of:
        0.112423226 = weight(_text_:22 in 402) [ClassicSimilarity], result of:
          0.112423226 = score(doc=402,freq=2.0), product of:
            0.18160844 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.051861014 = queryNorm
            0.61904186 = fieldWeight in 402, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.125 = fieldNorm(doc=402)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Source: Information processing and management. 22(1986) no.6, S.465-476

MacFarlane, A.; McCann, J.A.; Robertson, S.E.: Parallel methods for the generation of partitioned inverted files (2005) 0.02
```
0.016791517 = product of:
  0.05037455 = sum of:
    0.05037455 = product of:
      0.1007491 = sum of:
        0.1007491 = weight(_text_:indexing in 651) [ClassicSimilarity], result of:
          0.1007491 = score(doc=651,freq=8.0), product of:
            0.1985171 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051861014 = queryNorm
            0.5075084 = fieldWeight in 651, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.046875 = fieldNorm(doc=651)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)
```
Abstract

Purpose - The generation of inverted indexes is one of the most computationally intensive activities for information retrieval systems: indexing large multi-gigabyte text databases can take many hours or even days to complete. We examine the generation of partitioned inverted files in order to speed up the process of indexing. Two types of index partitions are investigated: TermId and DocId. Design/methodology/approach - We use standard measures used in parallel computing such as speedup and efficiency to examine the computing results and also the space costs of our trial indexing experiments. Findings - The results from runs on both partitioning methods are compared and contrasted, concluding that DocId is the more efficient method. Practical implications - The practical implications are that the DocId partitioning method would in most circumstances be used for distributing inverted file data in a parallel computer, particularly if indexing speed is the primary consideration. Originality/value - The paper is of value to database administrators who manage large-scale text collections, and who need to use parallel computing to implement their text retrieval services.

Smeaton, A.F.; Rijsbergen, C.J. van: ¬The retrieval effects of query expansion on a feedback document retrieval system (1983) 0.02

0.016395055 = product of:
  0.04918516 = sum of:
    0.04918516 = product of:
      0.09837032 = sum of:
        0.09837032 = weight(_text_:22 in 2134) [ClassicSimilarity], result of:
          0.09837032 = score(doc=2134,freq=2.0), product of:
            0.18160844 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.051861014 = queryNorm
            0.5416616 = fieldWeight in 2134, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.109375 = fieldNorm(doc=2134)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Date: 30. 3.2001 13:32:22

Back, J.: ¬An evaluation of relevancy ranking techniques used by Internet search engines (2000) 0.02

0.016395055 = product of:
  0.04918516 = sum of:
    0.04918516 = product of:
      0.09837032 = sum of:
        0.09837032 = weight(_text_:22 in 3445) [ClassicSimilarity], result of:
          0.09837032 = score(doc=3445,freq=2.0), product of:
            0.18160844 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.051861014 = queryNorm
            0.5416616 = fieldWeight in 3445, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.109375 = fieldNorm(doc=3445)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Date: 25. 8.2005 17:42:22

Chang, R.: ¬The development of indexing technology (1993) 0.02

0.015831191 = product of:
  0.047493573 = sum of:
    0.047493573 = product of:
      0.09498715 = sum of:
        0.09498715 = weight(_text_:indexing in 7024) [ClassicSimilarity], result of:
          0.09498715 = score(doc=7024,freq=4.0), product of:
            0.1985171 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051861014 = queryNorm
            0.47848347 = fieldWeight in 7024, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.0625 = fieldNorm(doc=7024)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Abstract: Reviews the basic techniques of computerized indexing, including various file accessing methods such as: Sequential Access Method (SAM); Direct Access Method (DAM); Indexed Sequential Access Method (ISAM), and Virtual Indexed Sequential Access Method (VSAM); and various B-tree (balanced tree)structures. Illustrates how records are stored and accessed, and how B-trees are used to for improving the operations of information retrieval and maintenance

Frakes, W.B.: Stemming algorithms (1992) 0.02

0.015831191 = product of:
  0.047493573 = sum of:
    0.047493573 = product of:
      0.09498715 = sum of:
        0.09498715 = weight(_text_:indexing in 3503) [ClassicSimilarity], result of:
          0.09498715 = score(doc=3503,freq=4.0), product of:
            0.1985171 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051861014 = queryNorm
            0.47848347 = fieldWeight in 3503, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.0625 = fieldNorm(doc=3503)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Abstract: Desribes stemming algorithms - programs that relate morphologically similar indexing and search terms. Stemming is used to improve retrieval effectiveness and to reduce the size of indexing files. Several approaches to stemming are describes - table lookup, affix removal, successor variety, and n-gram. empirical studies of stemming are summarized. The Porter stemmer is described in detail, and a full implementation in C is presented

Maron, M.E.: ¬An historical note on the origins of probabilistic indexing (2008) 0.02

0.015831191 = product of:
  0.047493573 = sum of:
    0.047493573 = product of:
      0.09498715 = sum of:
        0.09498715 = weight(_text_:indexing in 2047) [ClassicSimilarity], result of:
          0.09498715 = score(doc=2047,freq=4.0), product of:
            0.1985171 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051861014 = queryNorm
            0.47848347 = fieldWeight in 2047, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.0625 = fieldNorm(doc=2047)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Abstract: The motivation behind "Probabilistic Indexing" was to replace two-valued thinking about information retrieval with probabilistic notions. This involved a new view of the information retrieval problem - viewing it as problem of inference and prediction, and introducing probabilistically weighted indexes and probabilistically ranked output. These ideas were first formulated and written up in August 1958.

Thompson, P.: Looking back: on relevance, probabilistic indexing and information retrieval (2008) 0.02

0.015831191 = product of:
  0.047493573 = sum of:
    0.047493573 = product of:
      0.09498715 = sum of:
        0.09498715 = weight(_text_:indexing in 2074) [ClassicSimilarity], result of:
          0.09498715 = score(doc=2074,freq=4.0), product of:
            0.1985171 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051861014 = queryNorm
            0.47848347 = fieldWeight in 2074, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.0625 = fieldNorm(doc=2074)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Abstract: Forty-eight years ago Maron and Kuhns published their paper, "On Relevance, Probabilistic Indexing and Information Retrieval" (1960). This was the first paper to present a probabilistic approach to information retrieval, and perhaps the first paper on ranked retrieval. Although it is one of the most widely cited papers in the field of information retrieval, many researchers today may not be familiar with its influence. This paper describes the Maron and Kuhns article and the influence that it has had on the field of information retrieval.

Efron, M.: Query expansion and dimensionality reduction : Notions of optimality in Rocchio relevance feedback and latent semantic indexing (2008) 0.01
```
0.014541878 = product of:
  0.043625634 = sum of:
    0.043625634 = product of:
      0.08725127 = sum of:
        0.08725127 = weight(_text_:indexing in 2020) [ClassicSimilarity], result of:
          0.08725127 = score(doc=2020,freq=6.0), product of:
            0.1985171 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051861014 = queryNorm
            0.4395151 = fieldWeight in 2020, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.046875 = fieldNorm(doc=2020)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)
```
Abstract

Rocchio relevance feedback and latent semantic indexing (LSI) are well-known extensions of the vector space model for information retrieval (IR). This paper analyzes the statistical relationship between these extensions. The analysis focuses on each method's basis in least-squares optimization. Noting that LSI and Rocchio relevance feedback both alter the vector space model in a way that is in some sense least-squares optimal, we ask: what is the relationship between LSI's and Rocchio's notions of optimality? What does this relationship imply for IR? Using an analytical approach, we argue that Rocchio relevance feedback is optimal if we understand retrieval as a simplified classification problem. On the other hand, LSI's motivation comes to the fore if we understand it as a biased regression technique, where projection onto a low-dimensional orthogonal subspace of the documents reduces model variance.

Object

Latent semantic indexing
Deerwester, S.; Dumais, S.; Landauer, T.; Furnass, G.; Beck, L.: Improving information retrieval with latent semantic indexing (1988) 0.01
```
0.014541878 = product of:
  0.043625634 = sum of:
    0.043625634 = product of:
      0.08725127 = sum of:
        0.08725127 = weight(_text_:indexing in 2396) [ClassicSimilarity], result of:
          0.08725127 = score(doc=2396,freq=6.0), product of:
            0.1985171 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051861014 = queryNorm
            0.4395151 = fieldWeight in 2396, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.046875 = fieldNorm(doc=2396)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)
```
Abstract

Describes a latent semantic indexing (LSI) approach for improving information retrieval. Most document retrieval systems depend on matching keywords in queries against those in documents. The LSI approach tries to overcome the incompleteness and imprecision of latent relations among terms and documents. Tested performance of the LSI method ranged from considerably better than to roughly comparable to performance based on weighted keyword matching, apparently depending on the quality of the queries. Best LSI performance was found using a global entropy weighting for terms and about 100 dimensions for representing terms, documents and queries.

Object

Latent Semantic Indexing
Deerwester, S.C.; Dumais, S.T.; Landauer, T.K.; Furnas, G.W.; Harshman, R.A.: Indexing by latent semantic analysis (1990) 0.01
```
0.014541878 = product of:
  0.043625634 = sum of:
    0.043625634 = product of:
      0.08725127 = sum of:
        0.08725127 = weight(_text_:indexing in 2399) [ClassicSimilarity], result of:
          0.08725127 = score(doc=2399,freq=6.0), product of:
            0.1985171 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051861014 = queryNorm
            0.4395151 = fieldWeight in 2399, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.046875 = fieldNorm(doc=2399)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)
```
Abstract

A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents ("semantic structure") in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. Initial tests find this completely automatic method for retrieval to be promising.

Object

Latent Semantic Indexing
Zhang, W.; Yoshida, T.; Tang, X.: ¬A comparative study of TF*IDF, LSI and multi-words for text classification (2011) 0.01
```
0.014541878 = product of:
  0.043625634 = sum of:
    0.043625634 = product of:
      0.08725127 = sum of:
        0.08725127 = weight(_text_:indexing in 1165) [ClassicSimilarity], result of:
          0.08725127 = score(doc=1165,freq=6.0), product of:
            0.1985171 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051861014 = queryNorm
            0.4395151 = fieldWeight in 1165, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.046875 = fieldNorm(doc=1165)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)
```
Abstract

One of the main themes in text mining is text representation, which is fundamental and indispensable for text-based intellegent information processing. Generally, text representation inludes two tasks: indexing and weighting. This paper has comparatively studied TF*IDF, LSI and multi-word for text representation. We used a Chinese and an English document collection to respectively evaluate the three methods in information retreival and text categorization. Experimental results have demonstrated that in text categorization, LSI has better performance than other methods in both document collections. Also, LSI has produced the best performance in retrieving English documents. This outcome has shown that LSI has both favorable semantic and statistical quality and is different with the claim that LSI can not produce discriminative power for indexing.

Object

Latent Semantic Indexing

Fuhr, N.: Ranking-Experimente mit gewichteter Indexierung (1986) 0.01

0.014052903 = product of:
  0.042158708 = sum of:
    0.042158708 = product of:
      0.084317416 = sum of:
        0.084317416 = weight(_text_:22 in 58) [ClassicSimilarity], result of:
          0.084317416 = score(doc=58,freq=2.0), product of:
            0.18160844 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.051861014 = queryNorm
            0.46428138 = fieldWeight in 58, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.09375 = fieldNorm(doc=58)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Date: 14. 6.2015 22:12:44

Fuhr, N.: Rankingexperimente mit gewichteter Indexierung (1986) 0.01

0.014052903 = product of:
  0.042158708 = sum of:
    0.042158708 = product of:
      0.084317416 = sum of:
        0.084317416 = weight(_text_:22 in 2051) [ClassicSimilarity], result of:
          0.084317416 = score(doc=2051,freq=2.0), product of:
            0.18160844 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.051861014 = queryNorm
            0.46428138 = fieldWeight in 2051, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.09375 = fieldNorm(doc=2051)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Date: 14. 6.2015 22:12:56

Search (68 results, page 1 of 4)

Authors

Years

Languages

Types

Themes

Subjects

Classifications