Document (#25033)

Author
Greiff, W.R.
Title
¬The use of exploratory data analysis in information retrieval research
Source
Advances in information retrieval: Recent research from the Center for Intelligent Information Retrieval. Ed.: W.B. Croft
Imprint
Boston, MA : Kluwer Academic Publ.
Year
2000
Pages
S.37-72
Series
The Kluwer international series on information retrieval; 7
Abstract
We report on a line of work in which techniques of Exploratory Data Analysis (EDA) have been used as a vehicle for better understanding of the issues confronting the researcher in information retrieval (IR). EDA is used for visualizing and studying data for the purpose of uncovering statistical regularities that might not be apparent otherwise. The analysis is carried out in terms of the formal notion of Weight of Evidence (WOE). As a result of this analysis, a novel theory in support of the use of inverse document frequency (idf) for document ranking is presented, and experimental evidence is given in favor of a modification of the classical idf formula motivated by the analysis. This approach is then extended to other sources of evidence commonly used for ranking in information retrieval systems

Similar documents (content)

  1. Alzahrani, S.; Palade, V.; Salim, N.; Abraham, A.: Using structural information and citation evidence to detect significant plagiarism cases in scientific publications (2012) 0.14
    0.13983585 = sum of:
      0.13983585 = product of:
        0.49941373 = sum of:
          0.011438867 = weight(abstract_txt:information in 4982) [ClassicSimilarity], result of:
            0.011438867 = score(doc=4982,freq=2.0), product of:
              0.06109347 = queryWeight, product of:
                1.1232737 = boost
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.02246591 = queryNorm
              0.18723552 = fieldWeight in 4982, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4982)
          0.15355079 = weight(abstract_txt:weight in 4982) [ClassicSimilarity], result of:
            0.15355079 = score(doc=4982,freq=4.0), product of:
              0.18989946 = queryWeight, product of:
                1.143377 = boost
                7.3928223 = idf(docFreq=73, maxDocs=44218)
                0.02246591 = queryNorm
              0.80858994 = fieldWeight in 4982, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.3928223 = idf(docFreq=73, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4982)
          0.086396225 = weight(abstract_txt:inverse in 4982) [ClassicSimilarity], result of:
            0.086396225 = score(doc=4982,freq=1.0), product of:
              0.2054497 = queryWeight, product of:
                1.1892695 = boost
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.02246591 = queryNorm
              0.4205225 = fieldWeight in 4982, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4982)
          0.030059725 = weight(abstract_txt:document in 4982) [ClassicSimilarity], result of:
            0.030059725 = score(doc=4982,freq=1.0), product of:
              0.12804885 = queryWeight, product of:
                1.3277931 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.02246591 = queryNorm
              0.23475201 = fieldWeight in 4982, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4982)
          0.03056142 = weight(abstract_txt:used in 4982) [ClassicSimilarity], result of:
            0.03056142 = score(doc=4982,freq=2.0), product of:
              0.117631 = queryWeight, product of:
                1.5586518 = boost
                3.3592992 = idf(docFreq=4177, maxDocs=44218)
                0.02246591 = queryNorm
              0.25980753 = fieldWeight in 4982, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.3592992 = idf(docFreq=4177, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4982)
          0.033833288 = weight(abstract_txt:retrieval in 4982) [ClassicSimilarity], result of:
            0.033833288 = score(doc=4982,freq=2.0), product of:
              0.12588353 = queryWeight, product of:
                1.6123995 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.02246591 = queryNorm
              0.26876658 = fieldWeight in 4982, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4982)
          0.15357341 = weight(abstract_txt:evidence in 4982) [ClassicSimilarity], result of:
            0.15357341 = score(doc=4982,freq=3.0), product of:
              0.3014762 = queryWeight, product of:
                2.4952538 = boost
                5.377919 = idf(docFreq=554, maxDocs=44218)
                0.02246591 = queryNorm
              0.5094047 = fieldWeight in 4982, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.377919 = idf(docFreq=554, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4982)
        0.28 = coord(7/25)
    
  2. Alemayehu, N.: Analysis of performance variation using quey expansion (2003) 0.13
    0.12983198 = sum of:
      0.12983198 = product of:
        0.40572494 = sum of:
          0.059746284 = weight(abstract_txt:classical in 1454) [ClassicSimilarity], result of:
            0.059746284 = score(doc=1454,freq=1.0), product of:
              0.1469789 = queryWeight, product of:
                1.0059006 = boost
                6.5039306 = idf(docFreq=179, maxDocs=44218)
                0.02246591 = queryNorm
              0.40649566 = fieldWeight in 1454, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5039306 = idf(docFreq=179, maxDocs=44218)
                0.0625 = fieldNorm(doc=1454)
          0.013072991 = weight(abstract_txt:information in 1454) [ClassicSimilarity], result of:
            0.013072991 = score(doc=1454,freq=2.0), product of:
              0.06109347 = queryWeight, product of:
                1.1232737 = boost
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.02246591 = queryNorm
              0.21398345 = fieldWeight in 1454, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.0625 = fieldNorm(doc=1454)
          0.03435397 = weight(abstract_txt:document in 1454) [ClassicSimilarity], result of:
            0.03435397 = score(doc=1454,freq=1.0), product of:
              0.12804885 = queryWeight, product of:
                1.3277931 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.02246591 = queryNorm
              0.26828802 = fieldWeight in 1454, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=1454)
          0.024194598 = weight(abstract_txt:data in 1454) [ClassicSimilarity], result of:
            0.024194598 = score(doc=1454,freq=1.0), product of:
              0.11602914 = queryWeight, product of:
                1.5480028 = boost
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.02246591 = queryNorm
              0.20852174 = fieldWeight in 1454, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.0625 = fieldNorm(doc=1454)
          0.024697358 = weight(abstract_txt:used in 1454) [ClassicSimilarity], result of:
            0.024697358 = score(doc=1454,freq=1.0), product of:
              0.117631 = queryWeight, product of:
                1.5586518 = boost
                3.3592992 = idf(docFreq=4177, maxDocs=44218)
                0.02246591 = queryNorm
              0.2099562 = fieldWeight in 1454, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3592992 = idf(docFreq=4177, maxDocs=44218)
                0.0625 = fieldNorm(doc=1454)
          0.06697255 = weight(abstract_txt:retrieval in 1454) [ClassicSimilarity], result of:
            0.06697255 = score(doc=1454,freq=6.0), product of:
              0.12588353 = queryWeight, product of:
                1.6123995 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.02246591 = queryNorm
              0.5320199 = fieldWeight in 1454, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.0625 = fieldNorm(doc=1454)
          0.10779933 = weight(abstract_txt:ranking in 1454) [ClassicSimilarity], result of:
            0.10779933 = score(doc=1454,freq=2.0), product of:
              0.21783371 = queryWeight, product of:
                1.7318294 = boost
                5.598813 = idf(docFreq=444, maxDocs=44218)
                0.02246591 = queryNorm
              0.49486983 = fieldWeight in 1454, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.598813 = idf(docFreq=444, maxDocs=44218)
                0.0625 = fieldNorm(doc=1454)
          0.07488789 = weight(abstract_txt:analysis in 1454) [ClassicSimilarity], result of:
            0.07488789 = score(doc=1454,freq=2.0), product of:
              0.23190072 = queryWeight, product of:
                2.8252938 = boost
                3.6535451 = idf(docFreq=3112, maxDocs=44218)
                0.02246591 = queryNorm
              0.3229308 = fieldWeight in 1454, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.6535451 = idf(docFreq=3112, maxDocs=44218)
                0.0625 = fieldNorm(doc=1454)
        0.32 = coord(8/25)
    
  3. Losee, R.M.: Text windows and phrases differing by discipline, location in document, and syntactic structure (1996) 0.13
    0.12971064 = sum of:
      0.12971064 = product of:
        0.540461 = sum of:
          0.08805154 = weight(abstract_txt:studying in 6962) [ClassicSimilarity], result of:
            0.08805154 = score(doc=6962,freq=1.0), product of:
              0.1452596 = queryWeight, product of:
                6.465779 = idf(docFreq=186, maxDocs=44218)
                0.02246591 = queryNorm
              0.6061668 = fieldWeight in 6962, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.465779 = idf(docFreq=186, maxDocs=44218)
                0.09375 = fieldNorm(doc=6962)
          0.013866002 = weight(abstract_txt:information in 6962) [ClassicSimilarity], result of:
            0.013866002 = score(doc=6962,freq=1.0), product of:
              0.06109347 = queryWeight, product of:
                1.1232737 = boost
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.02246591 = queryNorm
              0.22696373 = fieldWeight in 6962, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.09375 = fieldNorm(doc=6962)
          0.051530957 = weight(abstract_txt:document in 6962) [ClassicSimilarity], result of:
            0.051530957 = score(doc=6962,freq=1.0), product of:
              0.12804885 = queryWeight, product of:
                1.3277931 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.02246591 = queryNorm
              0.40243202 = fieldWeight in 6962, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.09375 = fieldNorm(doc=6962)
          0.29360938 = weight(abstract_txt:regularities in 6962) [ClassicSimilarity], result of:
            0.29360938 = score(doc=6962,freq=2.0), product of:
              0.25732985 = queryWeight, product of:
                1.3309834 = boost
                8.6058445 = idf(docFreq=21, maxDocs=44218)
                0.02246591 = queryNorm
              1.1409845 = fieldWeight in 6962, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.6058445 = idf(docFreq=21, maxDocs=44218)
                0.09375 = fieldNorm(doc=6962)
          0.052391004 = weight(abstract_txt:used in 6962) [ClassicSimilarity], result of:
            0.052391004 = score(doc=6962,freq=2.0), product of:
              0.117631 = queryWeight, product of:
                1.5586518 = boost
                3.3592992 = idf(docFreq=4177, maxDocs=44218)
                0.02246591 = queryNorm
              0.44538432 = fieldWeight in 6962, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.3592992 = idf(docFreq=4177, maxDocs=44218)
                0.09375 = fieldNorm(doc=6962)
          0.041012138 = weight(abstract_txt:retrieval in 6962) [ClassicSimilarity], result of:
            0.041012138 = score(doc=6962,freq=1.0), product of:
              0.12588353 = queryWeight, product of:
                1.6123995 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.02246591 = queryNorm
              0.3257943 = fieldWeight in 6962, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.09375 = fieldNorm(doc=6962)
        0.24 = coord(6/25)
    
  4. Daoud, M.; Huang, J.X.: Modeling geographic, temporal, and proximity contexts for improving geotemporal search (2013) 0.13
    0.1255792 = sum of:
      0.1255792 = product of:
        0.44849712 = sum of:
          0.014009693 = weight(abstract_txt:information in 533) [ClassicSimilarity], result of:
            0.014009693 = score(doc=533,freq=3.0), product of:
              0.06109347 = queryWeight, product of:
                1.1232737 = boost
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.02246591 = queryNorm
              0.22931573 = fieldWeight in 533, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.0546875 = fieldNorm(doc=533)
          0.084051415 = weight(abstract_txt:formula in 533) [ClassicSimilarity], result of:
            0.084051415 = score(doc=533,freq=1.0), product of:
              0.2017154 = queryWeight, product of:
                1.1784118 = boost
                7.61935 = idf(docFreq=58, maxDocs=44218)
                0.02246591 = queryNorm
              0.4166832 = fieldWeight in 533, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.61935 = idf(docFreq=58, maxDocs=44218)
                0.0546875 = fieldNorm(doc=533)
          0.052064974 = weight(abstract_txt:document in 533) [ClassicSimilarity], result of:
            0.052064974 = score(doc=533,freq=3.0), product of:
              0.12804885 = queryWeight, product of:
                1.3277931 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.02246591 = queryNorm
              0.4066024 = fieldWeight in 533, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0546875 = fieldNorm(doc=533)
          0.0478475 = weight(abstract_txt:retrieval in 533) [ClassicSimilarity], result of:
            0.0478475 = score(doc=533,freq=4.0), product of:
              0.12588353 = queryWeight, product of:
                1.6123995 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.02246591 = queryNorm
              0.38009337 = fieldWeight in 533, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.0546875 = fieldNorm(doc=533)
          0.11552335 = weight(abstract_txt:ranking in 533) [ClassicSimilarity], result of:
            0.11552335 = score(doc=533,freq=3.0), product of:
              0.21783371 = queryWeight, product of:
                1.7318294 = boost
                5.598813 = idf(docFreq=444, maxDocs=44218)
                0.02246591 = queryNorm
              0.53032815 = fieldWeight in 533, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.598813 = idf(docFreq=444, maxDocs=44218)
                0.0546875 = fieldNorm(doc=533)
          0.08866565 = weight(abstract_txt:evidence in 533) [ClassicSimilarity], result of:
            0.08866565 = score(doc=533,freq=1.0), product of:
              0.3014762 = queryWeight, product of:
                2.4952538 = boost
                5.377919 = idf(docFreq=554, maxDocs=44218)
                0.02246591 = queryNorm
              0.29410496 = fieldWeight in 533, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.377919 = idf(docFreq=554, maxDocs=44218)
                0.0546875 = fieldNorm(doc=533)
          0.046334516 = weight(abstract_txt:analysis in 533) [ClassicSimilarity], result of:
            0.046334516 = score(doc=533,freq=1.0), product of:
              0.23190072 = queryWeight, product of:
                2.8252938 = boost
                3.6535451 = idf(docFreq=3112, maxDocs=44218)
                0.02246591 = queryNorm
              0.19980325 = fieldWeight in 533, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6535451 = idf(docFreq=3112, maxDocs=44218)
                0.0546875 = fieldNorm(doc=533)
        0.28 = coord(7/25)
    
  5. Liu, X.; Zhang, J.; Guo, C.: Full-text citation analysis : a new method to enhance scholarly networks (2013) 0.12
    0.12277707 = sum of:
      0.12277707 = product of:
        0.43848953 = sum of:
          0.084494 = weight(abstract_txt:classical in 1044) [ClassicSimilarity], result of:
            0.084494 = score(doc=1044,freq=2.0), product of:
              0.1469789 = queryWeight, product of:
                1.0059006 = boost
                6.5039306 = idf(docFreq=179, maxDocs=44218)
                0.02246591 = queryNorm
              0.57487166 = fieldWeight in 1044, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.5039306 = idf(docFreq=179, maxDocs=44218)
                0.0625 = fieldNorm(doc=1044)
          0.009244001 = weight(abstract_txt:information in 1044) [ClassicSimilarity], result of:
            0.009244001 = score(doc=1044,freq=1.0), product of:
              0.06109347 = queryWeight, product of:
                1.1232737 = boost
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.02246591 = queryNorm
              0.15130915 = fieldWeight in 1044, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.0625 = fieldNorm(doc=1044)
          0.03435397 = weight(abstract_txt:document in 1044) [ClassicSimilarity], result of:
            0.03435397 = score(doc=1044,freq=1.0), product of:
              0.12804885 = queryWeight, product of:
                1.3277931 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.02246591 = queryNorm
              0.26828802 = fieldWeight in 1044, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=1044)
          0.024697358 = weight(abstract_txt:used in 1044) [ClassicSimilarity], result of:
            0.024697358 = score(doc=1044,freq=1.0), product of:
              0.117631 = queryWeight, product of:
                1.5586518 = boost
                3.3592992 = idf(docFreq=4177, maxDocs=44218)
                0.02246591 = queryNorm
              0.2099562 = fieldWeight in 1044, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3592992 = idf(docFreq=4177, maxDocs=44218)
                0.0625 = fieldNorm(doc=1044)
          0.027341427 = weight(abstract_txt:retrieval in 1044) [ClassicSimilarity], result of:
            0.027341427 = score(doc=1044,freq=1.0), product of:
              0.12588353 = queryWeight, product of:
                1.6123995 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.02246591 = queryNorm
              0.21719621 = fieldWeight in 1044, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.0625 = fieldNorm(doc=1044)
          0.15245128 = weight(abstract_txt:ranking in 1044) [ClassicSimilarity], result of:
            0.15245128 = score(doc=1044,freq=4.0), product of:
              0.21783371 = queryWeight, product of:
                1.7318294 = boost
                5.598813 = idf(docFreq=444, maxDocs=44218)
                0.02246591 = queryNorm
              0.69985163 = fieldWeight in 1044, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.598813 = idf(docFreq=444, maxDocs=44218)
                0.0625 = fieldNorm(doc=1044)
          0.10590747 = weight(abstract_txt:analysis in 1044) [ClassicSimilarity], result of:
            0.10590747 = score(doc=1044,freq=4.0), product of:
              0.23190072 = queryWeight, product of:
                2.8252938 = boost
                3.6535451 = idf(docFreq=3112, maxDocs=44218)
                0.02246591 = queryNorm
              0.45669314 = fieldWeight in 1044, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.6535451 = idf(docFreq=3112, maxDocs=44218)
                0.0625 = fieldNorm(doc=1044)
        0.28 = coord(7/25)