Search (68 results, page 1 of 4)

Kanaeva, Z.: Ranking: Google und CiteSeer (2005) 0.05

0.053932853 = product of:
  0.107865706 = sum of:
    0.107865706 = sum of:
      0.05872144 = weight(_text_:indexing in 3276) [ClassicSimilarity], result of:
        0.05872144 = score(doc=3276,freq=2.0), product of:
          0.19835205 = queryWeight, product of:
            3.8278677 = idf(docFreq=2614, maxDocs=44218)
            0.051817898 = queryNorm
          0.29604656 = fieldWeight in 3276, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.8278677 = idf(docFreq=2614, maxDocs=44218)
            0.0546875 = fieldNorm(doc=3276)
      0.049144268 = weight(_text_:22 in 3276) [ClassicSimilarity], result of:
        0.049144268 = score(doc=3276,freq=2.0), product of:
          0.18145745 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.051817898 = queryNorm
          0.2708308 = fieldWeight in 3276, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0546875 = fieldNorm(doc=3276)
  0.5 = coord(1/2)

Abstract: Im Rahmen des klassischen Information Retrieval wurden verschiedene Verfahren für das Ranking sowie die Suche in einer homogenen strukturlosen Dokumentenmenge entwickelt. Die Erfolge der Suchmaschine Google haben gezeigt dass die Suche in einer zwar inhomogenen aber zusammenhängenden Dokumentenmenge wie dem Internet unter Berücksichtigung der Dokumentenverbindungen (Links) sehr effektiv sein kann. Unter den von der Suchmaschine Google realisierten Konzepten ist ein Verfahren zum Ranking von Suchergebnissen (PageRank), das in diesem Artikel kurz erklärt wird. Darüber hinaus wird auf die Konzepte eines Systems namens CiteSeer eingegangen, welches automatisch bibliographische Angaben indexiert (engl. Autonomous Citation Indexing, ACI). Letzteres erzeugt aus einer Menge von nicht vernetzten wissenschaftlichen Dokumenten eine zusammenhängende Dokumentenmenge und ermöglicht den Einsatz von Banking-Verfahren, die auf den von Google genutzten Verfahren basieren.
Date: 20. 3.2005 16:23:22

Burgin, R.: ¬The retrieval effectiveness of 5 clustering algorithms as a function of indexing exhaustivity (1995) 0.05
```
0.053875998 = product of:
  0.107751995 = sum of:
    0.107751995 = sum of:
      0.07264894 = weight(_text_:indexing in 3365) [ClassicSimilarity], result of:
        0.07264894 = score(doc=3365,freq=6.0), product of:
          0.19835205 = queryWeight, product of:
            3.8278677 = idf(docFreq=2614, maxDocs=44218)
            0.051817898 = queryNorm
          0.3662626 = fieldWeight in 3365, product of:
            2.4494898 = tf(freq=6.0), with freq of:
              6.0 = termFreq=6.0
            3.8278677 = idf(docFreq=2614, maxDocs=44218)
            0.0390625 = fieldNorm(doc=3365)
      0.03510305 = weight(_text_:22 in 3365) [ClassicSimilarity], result of:
        0.03510305 = score(doc=3365,freq=2.0), product of:
          0.18145745 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.051817898 = queryNorm
          0.19345059 = fieldWeight in 3365, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0390625 = fieldNorm(doc=3365)
  0.5 = coord(1/2)
```
Abstract

The retrieval effectiveness of 5 hierarchical clustering methods (single link, complete link, group average, Ward's method, and weighted average) is examined as a function of indexing exhaustivity with 4 test collections (CR, Cranfield, Medlars, and Time). Evaluations of retrieval effectiveness, based on 3 measures of optimal retrieval performance, confirm earlier findings that the performance of a retrieval system based on single link clustering varies as a function of indexing exhaustivity but fail ti find similar patterns for other clustering methods. The data also confirm earlier findings regarding the poor performance of single link clustering is a retrieval environment. The poor performance of single link clustering appears to derive from that method's tendency to produce a small number of large, ill defined document clusters. By contrast, the data examined here found the retrieval performance of the other clustering methods to be general comparable. The data presented also provides an opportunity to examine the theoretical limits of cluster based retrieval and to compare these theoretical limits to the effectiveness of operational implementations. Performance standards of the 4 document collections examined were found to vary widely, and the effectiveness of operational implementations were found to be in the range defined as unacceptable. Further improvements in search strategies and document representations warrant investigations

Date

22. 2.1996 11:20:06
Kelledy, F.; Smeaton, A.F.: Signature files and beyond (1996) 0.05
```
0.046228163 = product of:
  0.092456326 = sum of:
    0.092456326 = sum of:
      0.050332665 = weight(_text_:indexing in 6973) [ClassicSimilarity], result of:
        0.050332665 = score(doc=6973,freq=2.0), product of:
          0.19835205 = queryWeight, product of:
            3.8278677 = idf(docFreq=2614, maxDocs=44218)
            0.051817898 = queryNorm
          0.2537542 = fieldWeight in 6973, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.8278677 = idf(docFreq=2614, maxDocs=44218)
            0.046875 = fieldNorm(doc=6973)
      0.042123657 = weight(_text_:22 in 6973) [ClassicSimilarity], result of:
        0.042123657 = score(doc=6973,freq=2.0), product of:
          0.18145745 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.051817898 = queryNorm
          0.23214069 = fieldWeight in 6973, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.046875 = fieldNorm(doc=6973)
  0.5 = coord(1/2)
```
Abstract

Proposes that signature files be used as a viable alternative to other indexing strategies such as inverted files for searching through large volumes of text. Demonstrates through simulation, that search times can be further reduced by enhancing the basic signature file concept using deterministic partitioning algorithms which eliminate the need for an exhaustive search of the entire signature file. Reports research to evaluate the performance of some deterministic partitioning algorithms in a non simulated environment using 276 MB of raw newspaper text (taken from the Wall Street Journal) and real user queries. Presents a selection of results to illustrate trends and highlight important aspects of the performance of these methods under realistic rather than simulated operating conditions. As a result of the research reported here certain aspects of this approach to signature files are shown to be found wanting and require improvement. Suggests lines of future research on the partitioning of signature files

Source

Information retrieval: new systems and current research. Proceedings of the 16th Research Colloquium of the British Computer Society Information Retrieval Specialist Group, Drymen, Scotland, 22-23 Mar 94. Ed.: R. Leon

Chang, R.: Keyword searching and indexing (1993) 0.04

0.037515752 = product of:
  0.075031504 = sum of:
    0.075031504 = product of:
      0.15006301 = sum of:
        0.15006301 = weight(_text_:indexing in 7223) [ClassicSimilarity], result of:
          0.15006301 = score(doc=7223,freq=10.0), product of:
            0.19835205 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051817898 = queryNorm
            0.7565488 = fieldWeight in 7223, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.0625 = fieldNorm(doc=7223)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Abstract: Explains how a computer indexing system works. Reviews fundamentals of how data are stored and retrieved by computers. Describes B-Tree and B+-Tree indexing structures. Gives basic keyword searching techniques that the user must apply to make use of the indexing programs. The demand for keyword retrieval is increasing and librarians should expect to see the keyword-indexing feature become commonly available

Abdelkareem, M.A.A.: In terms of publication index, what indicator is the best for researchers indexing, Google Scholar, Scopus, Clarivate or others? (2018) 0.03
```
0.032826282 = product of:
  0.065652564 = sum of:
    0.065652564 = product of:
      0.13130513 = sum of:
        0.13130513 = weight(_text_:indexing in 4548) [ClassicSimilarity], result of:
          0.13130513 = score(doc=4548,freq=10.0), product of:
            0.19835205 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051817898 = queryNorm
            0.6619802 = fieldWeight in 4548, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.0546875 = fieldNorm(doc=4548)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

I believe that Google Scholar is the most popular academic indexing way for researchers and citations. However, some other indexing institutions may be more professional than Google Scholar but not as popular as Google Scholar. Other indexing websites like Scopus and Clarivate are providing more statistical figures for scholars, institutions or even journals. On account of publication citations, always Google Scholar shows higher citations for a paper than other indexing websites since Google Scholar consider most of the publication platforms so he can easily count the citations. While other databases just consider the citations come from those journals that are already indexed in their database

Voorhees, E.M.: Implementing agglomerative hierarchic clustering algorithms for use in document retrieval (1986) 0.03

0.028082438 = product of:
  0.056164876 = sum of:
    0.056164876 = product of:
      0.11232975 = sum of:
        0.11232975 = weight(_text_:22 in 402) [ClassicSimilarity], result of:
          0.11232975 = score(doc=402,freq=2.0), product of:
            0.18145745 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.051817898 = queryNorm
            0.61904186 = fieldWeight in 402, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.125 = fieldNorm(doc=402)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Source: Information processing and management. 22(1986) no.6, S.465-476

MacFarlane, A.; McCann, J.A.; Robertson, S.E.: Parallel methods for the generation of partitioned inverted files (2005) 0.03
```
0.025166333 = product of:
  0.050332665 = sum of:
    0.050332665 = product of:
      0.10066533 = sum of:
        0.10066533 = weight(_text_:indexing in 651) [ClassicSimilarity], result of:
          0.10066533 = score(doc=651,freq=8.0), product of:
            0.19835205 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051817898 = queryNorm
            0.5075084 = fieldWeight in 651, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.046875 = fieldNorm(doc=651)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Purpose - The generation of inverted indexes is one of the most computationally intensive activities for information retrieval systems: indexing large multi-gigabyte text databases can take many hours or even days to complete. We examine the generation of partitioned inverted files in order to speed up the process of indexing. Two types of index partitions are investigated: TermId and DocId. Design/methodology/approach - We use standard measures used in parallel computing such as speedup and efficiency to examine the computing results and also the space costs of our trial indexing experiments. Findings - The results from runs on both partitioning methods are compared and contrasted, concluding that DocId is the more efficient method. Practical implications - The practical implications are that the DocId partitioning method would in most circumstances be used for distributing inverted file data in a parallel computer, particularly if indexing speed is the primary consideration. Originality/value - The paper is of value to database administrators who manage large-scale text collections, and who need to use parallel computing to implement their text retrieval services.

Smeaton, A.F.; Rijsbergen, C.J. van: ¬The retrieval effects of query expansion on a feedback document retrieval system (1983) 0.02

0.024572134 = product of:
  0.049144268 = sum of:
    0.049144268 = product of:
      0.098288536 = sum of:
        0.098288536 = weight(_text_:22 in 2134) [ClassicSimilarity], result of:
          0.098288536 = score(doc=2134,freq=2.0), product of:
            0.18145745 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.051817898 = queryNorm
            0.5416616 = fieldWeight in 2134, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.109375 = fieldNorm(doc=2134)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Date: 30. 3.2001 13:32:22

Back, J.: ¬An evaluation of relevancy ranking techniques used by Internet search engines (2000) 0.02

0.024572134 = product of:
  0.049144268 = sum of:
    0.049144268 = product of:
      0.098288536 = sum of:
        0.098288536 = weight(_text_:22 in 3445) [ClassicSimilarity], result of:
          0.098288536 = score(doc=3445,freq=2.0), product of:
            0.18145745 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.051817898 = queryNorm
            0.5416616 = fieldWeight in 3445, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.109375 = fieldNorm(doc=3445)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Date: 25. 8.2005 17:42:22

Chang, R.: ¬The development of indexing technology (1993) 0.02

0.023727044 = product of:
  0.04745409 = sum of:
    0.04745409 = product of:
      0.09490818 = sum of:
        0.09490818 = weight(_text_:indexing in 7024) [ClassicSimilarity], result of:
          0.09490818 = score(doc=7024,freq=4.0), product of:
            0.19835205 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051817898 = queryNorm
            0.47848347 = fieldWeight in 7024, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.0625 = fieldNorm(doc=7024)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Abstract: Reviews the basic techniques of computerized indexing, including various file accessing methods such as: Sequential Access Method (SAM); Direct Access Method (DAM); Indexed Sequential Access Method (ISAM), and Virtual Indexed Sequential Access Method (VSAM); and various B-tree (balanced tree)structures. Illustrates how records are stored and accessed, and how B-trees are used to for improving the operations of information retrieval and maintenance

Frakes, W.B.: Stemming algorithms (1992) 0.02

0.023727044 = product of:
  0.04745409 = sum of:
    0.04745409 = product of:
      0.09490818 = sum of:
        0.09490818 = weight(_text_:indexing in 3503) [ClassicSimilarity], result of:
          0.09490818 = score(doc=3503,freq=4.0), product of:
            0.19835205 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051817898 = queryNorm
            0.47848347 = fieldWeight in 3503, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.0625 = fieldNorm(doc=3503)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Abstract: Desribes stemming algorithms - programs that relate morphologically similar indexing and search terms. Stemming is used to improve retrieval effectiveness and to reduce the size of indexing files. Several approaches to stemming are describes - table lookup, affix removal, successor variety, and n-gram. empirical studies of stemming are summarized. The Porter stemmer is described in detail, and a full implementation in C is presented

Maron, M.E.: ¬An historical note on the origins of probabilistic indexing (2008) 0.02

0.023727044 = product of:
  0.04745409 = sum of:
    0.04745409 = product of:
      0.09490818 = sum of:
        0.09490818 = weight(_text_:indexing in 2047) [ClassicSimilarity], result of:
          0.09490818 = score(doc=2047,freq=4.0), product of:
            0.19835205 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051817898 = queryNorm
            0.47848347 = fieldWeight in 2047, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.0625 = fieldNorm(doc=2047)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Abstract: The motivation behind "Probabilistic Indexing" was to replace two-valued thinking about information retrieval with probabilistic notions. This involved a new view of the information retrieval problem - viewing it as problem of inference and prediction, and introducing probabilistically weighted indexes and probabilistically ranked output. These ideas were first formulated and written up in August 1958.

Thompson, P.: Looking back: on relevance, probabilistic indexing and information retrieval (2008) 0.02

0.023727044 = product of:
  0.04745409 = sum of:
    0.04745409 = product of:
      0.09490818 = sum of:
        0.09490818 = weight(_text_:indexing in 2074) [ClassicSimilarity], result of:
          0.09490818 = score(doc=2074,freq=4.0), product of:
            0.19835205 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051817898 = queryNorm
            0.47848347 = fieldWeight in 2074, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.0625 = fieldNorm(doc=2074)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Abstract: Forty-eight years ago Maron and Kuhns published their paper, "On Relevance, Probabilistic Indexing and Information Retrieval" (1960). This was the first paper to present a probabilistic approach to information retrieval, and perhaps the first paper on ranked retrieval. Although it is one of the most widely cited papers in the field of information retrieval, many researchers today may not be familiar with its influence. This paper describes the Maron and Kuhns article and the influence that it has had on the field of information retrieval.

Efron, M.: Query expansion and dimensionality reduction : Notions of optimality in Rocchio relevance feedback and latent semantic indexing (2008) 0.02
```
0.02179468 = product of:
  0.04358936 = sum of:
    0.04358936 = product of:
      0.08717872 = sum of:
        0.08717872 = weight(_text_:indexing in 2020) [ClassicSimilarity], result of:
          0.08717872 = score(doc=2020,freq=6.0), product of:
            0.19835205 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051817898 = queryNorm
            0.4395151 = fieldWeight in 2020, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.046875 = fieldNorm(doc=2020)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Rocchio relevance feedback and latent semantic indexing (LSI) are well-known extensions of the vector space model for information retrieval (IR). This paper analyzes the statistical relationship between these extensions. The analysis focuses on each method's basis in least-squares optimization. Noting that LSI and Rocchio relevance feedback both alter the vector space model in a way that is in some sense least-squares optimal, we ask: what is the relationship between LSI's and Rocchio's notions of optimality? What does this relationship imply for IR? Using an analytical approach, we argue that Rocchio relevance feedback is optimal if we understand retrieval as a simplified classification problem. On the other hand, LSI's motivation comes to the fore if we understand it as a biased regression technique, where projection onto a low-dimensional orthogonal subspace of the documents reduces model variance.

Object

Latent semantic indexing
Deerwester, S.; Dumais, S.; Landauer, T.; Furnass, G.; Beck, L.: Improving information retrieval with latent semantic indexing (1988) 0.02
```
0.02179468 = product of:
  0.04358936 = sum of:
    0.04358936 = product of:
      0.08717872 = sum of:
        0.08717872 = weight(_text_:indexing in 2396) [ClassicSimilarity], result of:
          0.08717872 = score(doc=2396,freq=6.0), product of:
            0.19835205 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051817898 = queryNorm
            0.4395151 = fieldWeight in 2396, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.046875 = fieldNorm(doc=2396)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Describes a latent semantic indexing (LSI) approach for improving information retrieval. Most document retrieval systems depend on matching keywords in queries against those in documents. The LSI approach tries to overcome the incompleteness and imprecision of latent relations among terms and documents. Tested performance of the LSI method ranged from considerably better than to roughly comparable to performance based on weighted keyword matching, apparently depending on the quality of the queries. Best LSI performance was found using a global entropy weighting for terms and about 100 dimensions for representing terms, documents and queries.

Object

Latent Semantic Indexing
Deerwester, S.C.; Dumais, S.T.; Landauer, T.K.; Furnas, G.W.; Harshman, R.A.: Indexing by latent semantic analysis (1990) 0.02
```
0.02179468 = product of:
  0.04358936 = sum of:
    0.04358936 = product of:
      0.08717872 = sum of:
        0.08717872 = weight(_text_:indexing in 2399) [ClassicSimilarity], result of:
          0.08717872 = score(doc=2399,freq=6.0), product of:
            0.19835205 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051817898 = queryNorm
            0.4395151 = fieldWeight in 2399, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.046875 = fieldNorm(doc=2399)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents ("semantic structure") in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. Initial tests find this completely automatic method for retrieval to be promising.

Object

Latent Semantic Indexing
Zhang, W.; Yoshida, T.; Tang, X.: ¬A comparative study of TF*IDF, LSI and multi-words for text classification (2011) 0.02
```
0.02179468 = product of:
  0.04358936 = sum of:
    0.04358936 = product of:
      0.08717872 = sum of:
        0.08717872 = weight(_text_:indexing in 1165) [ClassicSimilarity], result of:
          0.08717872 = score(doc=1165,freq=6.0), product of:
            0.19835205 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051817898 = queryNorm
            0.4395151 = fieldWeight in 1165, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.046875 = fieldNorm(doc=1165)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

One of the main themes in text mining is text representation, which is fundamental and indispensable for text-based intellegent information processing. Generally, text representation inludes two tasks: indexing and weighting. This paper has comparatively studied TF*IDF, LSI and multi-word for text representation. We used a Chinese and an English document collection to respectively evaluate the three methods in information retreival and text categorization. Experimental results have demonstrated that in text categorization, LSI has better performance than other methods in both document collections. Also, LSI has produced the best performance in retrieving English documents. This outcome has shown that LSI has both favorable semantic and statistical quality and is different with the claim that LSI can not produce discriminative power for indexing.

Object

Latent Semantic Indexing

Fuhr, N.: Ranking-Experimente mit gewichteter Indexierung (1986) 0.02

0.021061828 = product of:
  0.042123657 = sum of:
    0.042123657 = product of:
      0.08424731 = sum of:
        0.08424731 = weight(_text_:22 in 58) [ClassicSimilarity], result of:
          0.08424731 = score(doc=58,freq=2.0), product of:
            0.18145745 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.051817898 = queryNorm
            0.46428138 = fieldWeight in 58, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.09375 = fieldNorm(doc=58)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Date: 14. 6.2015 22:12:44

Fuhr, N.: Rankingexperimente mit gewichteter Indexierung (1986) 0.02

0.021061828 = product of:
  0.042123657 = sum of:
    0.042123657 = product of:
      0.08424731 = sum of:
        0.08424731 = weight(_text_:22 in 2051) [ClassicSimilarity], result of:
          0.08424731 = score(doc=2051,freq=2.0), product of:
            0.18145745 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.051817898 = queryNorm
            0.46428138 = fieldWeight in 2051, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.09375 = fieldNorm(doc=2051)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Date: 14. 6.2015 22:12:56

Willett, P.: Best-match text retrieval (1993) 0.02

0.020971943 = product of:
  0.041943885 = sum of:
    0.041943885 = product of:
      0.08388777 = sum of:
        0.08388777 = weight(_text_:indexing in 7818) [ClassicSimilarity], result of:
          0.08388777 = score(doc=7818,freq=2.0), product of:
            0.19835205 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.051817898 = queryNorm
            0.42292362 = fieldWeight in 7818, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.078125 = fieldNorm(doc=7818)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Abstract: Provides an introduction to the computational techniques that underlie best match searching retrieval systems. Discusses: problems of traditional Boolean systems; characteristics of best-match searching; automatic indexing; term conflation; matching of documents and queries (dealing with similarity measures, initial weights, relevance weights, and the matching algorithm); and describes operational best-match systems

Search (68 results, page 1 of 4)

Authors

Years

Languages

Types

Themes

Subjects

Classifications