Document (#25305)

Author
Lam, W.
Wong, K.-F.
Wong, C.-Y.
Title
Chinese document indexing based on new partitioned signature file : model and evaluation
Source
Journal of the American Society for Information Science and technology. 52(2001) no.7, S.584-597
Year
2001
Abstract
In this article we investigate the use of signature files in Chinese information retrieval system and propose a new partitioning method for Chinese signature file based on the characteristic of Chinese words. Our partitioning method, called Partitioned Signature File for Chinese (PSFC), offers faster search efficiency than the traditional single signature file approach. We devise a general scheme for controlling the trade-off between the false drop and storage overhead while maintaining the search space reduction in PSFC. An analytical study is presented to support the claims of our method. We also propose two new hashing methods for Chinese signature files so that the signature file will be more suitable for dynamic environment while the retrieval performance is maintained. Furthermore, we have implemented PSFC and the new hashing methods, and we evaluated them using a large-scale real-world Chinese document corpus, namely, the TREC-5 (Text REtrieval Conference) Chinese collection. The experimental results confirm the features of PSFC and demonstrate its superiority over the traditional single signature file method

Similar documents (author)

  1. Wong, S.K.M.: On modelling information retrieval with probabilistic inference (1995) 5.12
    5.123222 = sum of:
      5.123222 = weight(author_txt:wong in 2007) [ClassicSimilarity], result of:
        5.123222 = fieldWeight in 2007, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.197155 = idf(docFreq=31, maxDocs=42740)
          0.625 = fieldNorm(doc=2007)
    
  2. Wong, K.: Frühe Spuren des menschlichen Geistes (2005) 5.12
    5.123222 = sum of:
      5.123222 = weight(author_txt:wong in 1984) [ClassicSimilarity], result of:
        5.123222 = fieldWeight in 1984, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.197155 = idf(docFreq=31, maxDocs=42740)
          0.625 = fieldNorm(doc=1984)
    
  3. Salton, G.; Wong, A.: Generation and search of clustered files (1978) 4.10
    4.0985775 = sum of:
      4.0985775 = weight(author_txt:wong in 2411) [ClassicSimilarity], result of:
        4.0985775 = fieldWeight in 2411, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.197155 = idf(docFreq=31, maxDocs=42740)
          0.5 = fieldNorm(doc=2411)
    
  4. Wong, S.K.M.; Yao, Y.Y.: ¬An information-theoretic measure of term specifics (1992) 4.10
    4.0985775 = sum of:
      4.0985775 = weight(author_txt:wong in 4807) [ClassicSimilarity], result of:
        4.0985775 = fieldWeight in 4807, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.197155 = idf(docFreq=31, maxDocs=42740)
          0.5 = fieldNorm(doc=4807)
    
  5. Wong, W.Y.P.; Lee, D.L.: Implementation of partial document ranking using inverted files (1993) 4.10
    4.0985775 = sum of:
      4.0985775 = weight(author_txt:wong in 6539) [ClassicSimilarity], result of:
        4.0985775 = fieldWeight in 6539, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.197155 = idf(docFreq=31, maxDocs=42740)
          0.5 = fieldNorm(doc=6539)
    

Similar documents (content)

  1. Lee, D.L.; Ren, L.: Document ranking on weight-partitioned signature files (1996) 0.63
    0.63393265 = sum of:
      0.63393265 = product of:
        1.9810395 = sum of:
          0.01160298 = weight(abstract_txt:search in 3418) [ClassicSimilarity], result of:
            0.01160298 = score(doc=3418,freq=1.0), product of:
              0.03389399 = queryWeight, product of:
                1.0131133 = boost
                3.6515355 = idf(docFreq=3014, maxDocs=42740)
                0.009161977 = queryNorm
              0.34233147 = fieldWeight in 3418, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6515355 = idf(docFreq=3014, maxDocs=42740)
                0.09375 = fieldNorm(doc=3418)
          0.07615133 = weight(abstract_txt:false in 3418) [ClassicSimilarity], result of:
            0.07615133 = score(doc=3418,freq=2.0), product of:
              0.07484706 = queryWeight, product of:
                1.0645572 = boost
                7.6739063 = idf(docFreq=53, maxDocs=42740)
                0.009161977 = queryNorm
              1.0174258 = fieldWeight in 3418, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.6739063 = idf(docFreq=53, maxDocs=42740)
                0.09375 = fieldNorm(doc=3418)
          0.03738845 = weight(abstract_txt:document in 3418) [ClassicSimilarity], result of:
            0.03738845 = score(doc=3418,freq=4.0), product of:
              0.04658163 = queryWeight, product of:
                1.1876924 = boost
                4.280766 = idf(docFreq=1606, maxDocs=42740)
                0.009161977 = queryNorm
              0.80264366 = fieldWeight in 3418, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.280766 = idf(docFreq=1606, maxDocs=42740)
                0.09375 = fieldNorm(doc=3418)
          0.01486857 = weight(abstract_txt:retrieval in 3418) [ClassicSimilarity], result of:
            0.01486857 = score(doc=3418,freq=1.0), product of:
              0.045774076 = queryWeight, product of:
                1.4419562 = boost
                3.4648013 = idf(docFreq=3633, maxDocs=42740)
                0.009161977 = queryNorm
              0.3248251 = fieldWeight in 3418, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4648013 = idf(docFreq=3633, maxDocs=42740)
                0.09375 = fieldNorm(doc=3418)
          0.044357877 = weight(abstract_txt:files in 3418) [ClassicSimilarity], result of:
            0.044357877 = score(doc=3418,freq=1.0), product of:
              0.08286864 = queryWeight, product of:
                1.5841334 = boost
                5.709647 = idf(docFreq=384, maxDocs=42740)
                0.009161977 = queryNorm
              0.5352794 = fieldWeight in 3418, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.709647 = idf(docFreq=384, maxDocs=42740)
                0.09375 = fieldNorm(doc=3418)
          0.24764591 = weight(abstract_txt:partitioned in 3418) [ClassicSimilarity], result of:
            0.24764591 = score(doc=3418,freq=2.0), product of:
              0.20699213 = queryWeight, product of:
                2.5036497 = boost
                9.023833 = idf(docFreq=13, maxDocs=42740)
                0.009161977 = queryNorm
              1.1964025 = fieldWeight in 3418, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.023833 = idf(docFreq=13, maxDocs=42740)
                0.09375 = fieldNorm(doc=3418)
          0.24700223 = weight(abstract_txt:file in 3418) [ClassicSimilarity], result of:
            0.24700223 = score(doc=3418,freq=4.0), product of:
              0.23653607 = queryWeight, product of:
                4.6356 = boost
                5.5693207 = idf(docFreq=442, maxDocs=42740)
                0.009161977 = queryNorm
              1.0442476 = fieldWeight in 3418, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.5693207 = idf(docFreq=442, maxDocs=42740)
                0.09375 = fieldNorm(doc=3418)
          1.3020222 = weight(abstract_txt:signature in 3418) [ClassicSimilarity], result of:
            1.3020222 = score(doc=3418,freq=5.0), product of:
              0.73201275 = queryWeight, product of:
                9.416423 = boost
                8.484837 = idf(docFreq=23, maxDocs=42740)
                0.009161977 = queryNorm
              1.778688 = fieldWeight in 3418, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                8.484837 = idf(docFreq=23, maxDocs=42740)
                0.09375 = fieldNorm(doc=3418)
        0.32 = coord(8/25)
    
  2. Kelledy, F.; Smeaton, A.F.: Signature files and beyond (1996) 0.38
    0.38048333 = sum of:
      0.38048333 = product of:
        1.5853473 = sum of:
          0.013674242 = weight(abstract_txt:search in 43) [ClassicSimilarity], result of:
            0.013674242 = score(doc=43,freq=2.0), product of:
              0.03389399 = queryWeight, product of:
                1.0131133 = boost
                3.6515355 = idf(docFreq=3014, maxDocs=42740)
                0.009161977 = queryNorm
              0.4034415 = fieldWeight in 43, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.6515355 = idf(docFreq=3014, maxDocs=42740)
                0.078125 = fieldNorm(doc=43)
          0.014424718 = weight(abstract_txt:methods in 43) [ClassicSimilarity], result of:
            0.014424718 = score(doc=43,freq=1.0), product of:
              0.044252258 = queryWeight, product of:
                1.1576155 = boost
                4.172361 = idf(docFreq=1790, maxDocs=42740)
                0.009161977 = queryNorm
              0.3259657 = fieldWeight in 43, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.172361 = idf(docFreq=1790, maxDocs=42740)
                0.078125 = fieldNorm(doc=43)
          0.0739298 = weight(abstract_txt:files in 43) [ClassicSimilarity], result of:
            0.0739298 = score(doc=43,freq=4.0), product of:
              0.08286864 = queryWeight, product of:
                1.5841334 = boost
                5.709647 = idf(docFreq=384, maxDocs=42740)
                0.009161977 = queryNorm
              0.8921324 = fieldWeight in 43, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.709647 = idf(docFreq=384, maxDocs=42740)
                0.078125 = fieldNorm(doc=43)
          0.25275257 = weight(abstract_txt:partitioning in 43) [ClassicSimilarity], result of:
            0.25275257 = score(doc=43,freq=3.0), product of:
              0.20699213 = queryWeight, product of:
                2.5036497 = boost
                9.023833 = idf(docFreq=13, maxDocs=42740)
                0.009161977 = queryNorm
              1.2210733 = fieldWeight in 43, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.023833 = idf(docFreq=13, maxDocs=42740)
                0.078125 = fieldNorm(doc=43)
          0.14554745 = weight(abstract_txt:file in 43) [ClassicSimilarity], result of:
            0.14554745 = score(doc=43,freq=2.0), product of:
              0.23653607 = queryWeight, product of:
                4.6356 = boost
                5.5693207 = idf(docFreq=442, maxDocs=42740)
                0.009161977 = queryNorm
              0.6153288 = fieldWeight in 43, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.5693207 = idf(docFreq=442, maxDocs=42740)
                0.078125 = fieldNorm(doc=43)
          1.0850185 = weight(abstract_txt:signature in 43) [ClassicSimilarity], result of:
            1.0850185 = score(doc=43,freq=5.0), product of:
              0.73201275 = queryWeight, product of:
                9.416423 = boost
                8.484837 = idf(docFreq=23, maxDocs=42740)
                0.009161977 = queryNorm
              1.48224 = fieldWeight in 43, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                8.484837 = idf(docFreq=23, maxDocs=42740)
                0.078125 = fieldNorm(doc=43)
        0.24 = coord(6/25)
    
  3. Carterette, B.; Can, F.: Comparing inverted files and signature files for searching a large lexicon (2005) 0.24
    0.24165855 = sum of:
      0.24165855 = product of:
        1.510366 = sum of:
          0.06402508 = weight(abstract_txt:files in 3030) [ClassicSimilarity], result of:
            0.06402508 = score(doc=3030,freq=3.0), product of:
              0.08286864 = queryWeight, product of:
                1.5841334 = boost
                5.709647 = idf(docFreq=384, maxDocs=42740)
                0.009161977 = queryNorm
              0.77260923 = fieldWeight in 3030, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.709647 = idf(docFreq=384, maxDocs=42740)
                0.078125 = fieldNorm(doc=3030)
          0.051927365 = weight(abstract_txt:method in 3030) [ClassicSimilarity], result of:
            0.051927365 = score(doc=3030,freq=2.0), product of:
              0.10394288 = queryWeight, product of:
                2.5090482 = boost
                4.5216455 = idf(docFreq=1262, maxDocs=42740)
                0.009161977 = queryNorm
              0.49957597 = fieldWeight in 3030, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.5216455 = idf(docFreq=1262, maxDocs=42740)
                0.078125 = fieldNorm(doc=3030)
          0.2058352 = weight(abstract_txt:file in 3030) [ClassicSimilarity], result of:
            0.2058352 = score(doc=3030,freq=4.0), product of:
              0.23653607 = queryWeight, product of:
                4.6356 = boost
                5.5693207 = idf(docFreq=442, maxDocs=42740)
                0.009161977 = queryNorm
              0.87020636 = fieldWeight in 3030, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.5693207 = idf(docFreq=442, maxDocs=42740)
                0.078125 = fieldNorm(doc=3030)
          1.1885784 = weight(abstract_txt:signature in 3030) [ClassicSimilarity], result of:
            1.1885784 = score(doc=3030,freq=6.0), product of:
              0.73201275 = queryWeight, product of:
                9.416423 = boost
                8.484837 = idf(docFreq=23, maxDocs=42740)
                0.009161977 = queryNorm
              1.6237127 = fieldWeight in 3030, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                8.484837 = idf(docFreq=23, maxDocs=42740)
                0.078125 = fieldNorm(doc=3030)
        0.16 = coord(4/25)
    
  4. Faloutsos, C.: Signature files (1992) 0.17
    0.17018017 = sum of:
      0.17018017 = product of:
        1.4181681 = sum of:
          0.03497808 = weight(abstract_txt:methods in 4500) [ClassicSimilarity], result of:
            0.03497808 = score(doc=4500,freq=3.0), product of:
              0.044252258 = queryWeight, product of:
                1.1576155 = boost
                4.172361 = idf(docFreq=1790, maxDocs=42740)
                0.009161977 = queryNorm
              0.79042476 = fieldWeight in 4500, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.172361 = idf(docFreq=1790, maxDocs=42740)
                0.109375 = fieldNorm(doc=4500)
          0.024531888 = weight(abstract_txt:retrieval in 4500) [ClassicSimilarity], result of:
            0.024531888 = score(doc=4500,freq=2.0), product of:
              0.045774076 = queryWeight, product of:
                1.4419562 = boost
                3.4648013 = idf(docFreq=3633, maxDocs=42740)
                0.009161977 = queryNorm
              0.5359341 = fieldWeight in 4500, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4648013 = idf(docFreq=3633, maxDocs=42740)
                0.109375 = fieldNorm(doc=4500)
          1.3586581 = weight(abstract_txt:signature in 4500) [ClassicSimilarity], result of:
            1.3586581 = score(doc=4500,freq=4.0), product of:
              0.73201275 = queryWeight, product of:
                9.416423 = boost
                8.484837 = idf(docFreq=23, maxDocs=42740)
                0.009161977 = queryNorm
              1.856058 = fieldWeight in 4500, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                8.484837 = idf(docFreq=23, maxDocs=42740)
                0.109375 = fieldNorm(doc=4500)
        0.12 = coord(3/25)
    
  5. Almerri, J.; McGregor, D.R.: Codon signatures : a document retrieval method (1996) 0.16
    0.16397798 = sum of:
      0.16397798 = product of:
        0.8198899 = sum of:
          0.01557852 = weight(abstract_txt:document in 40) [ClassicSimilarity], result of:
            0.01557852 = score(doc=40,freq=1.0), product of:
              0.04658163 = queryWeight, product of:
                1.1876924 = boost
                4.280766 = idf(docFreq=1606, maxDocs=42740)
                0.009161977 = queryNorm
              0.33443484 = fieldWeight in 40, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.280766 = idf(docFreq=1606, maxDocs=42740)
                0.078125 = fieldNorm(doc=40)
          0.017522778 = weight(abstract_txt:retrieval in 40) [ClassicSimilarity], result of:
            0.017522778 = score(doc=40,freq=2.0), product of:
              0.045774076 = queryWeight, product of:
                1.4419562 = boost
                3.4648013 = idf(docFreq=3633, maxDocs=42740)
                0.009161977 = queryNorm
              0.3828101 = fieldWeight in 40, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4648013 = idf(docFreq=3633, maxDocs=42740)
                0.078125 = fieldNorm(doc=40)
          0.0369649 = weight(abstract_txt:files in 40) [ClassicSimilarity], result of:
            0.0369649 = score(doc=40,freq=1.0), product of:
              0.08286864 = queryWeight, product of:
                1.5841334 = boost
                5.709647 = idf(docFreq=384, maxDocs=42740)
                0.009161977 = queryNorm
              0.4460662 = fieldWeight in 40, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.709647 = idf(docFreq=384, maxDocs=42740)
                0.078125 = fieldNorm(doc=40)
          0.063597776 = weight(abstract_txt:method in 40) [ClassicSimilarity], result of:
            0.063597776 = score(doc=40,freq=3.0), product of:
              0.10394288 = queryWeight, product of:
                2.5090482 = boost
                4.5216455 = idf(docFreq=1262, maxDocs=42740)
                0.009161977 = queryNorm
              0.6118531 = fieldWeight in 40, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.5216455 = idf(docFreq=1262, maxDocs=42740)
                0.078125 = fieldNorm(doc=40)
          0.68622595 = weight(abstract_txt:signature in 40) [ClassicSimilarity], result of:
            0.68622595 = score(doc=40,freq=2.0), product of:
              0.73201275 = queryWeight, product of:
                9.416423 = boost
                8.484837 = idf(docFreq=23, maxDocs=42740)
                0.009161977 = queryNorm
              0.9374508 = fieldWeight in 40, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.484837 = idf(docFreq=23, maxDocs=42740)
                0.078125 = fieldNorm(doc=40)
        0.2 = coord(5/25)