Document (#32265)

Author
Zhan, J.
Loh, H.T.
Title
Using latent semantic indexing to improve the accuracy of document clustering
Source
Journal of information and knowledge management. 6(2007) no.3, S.181-188
Year
2007
Abstract
Document clustering is a significant research issue in information retrieval and text mining. Traditionally, most clustering methods were based on the vector space model which has a few limitations such as high dimensionality and weakness in handling synonymous and polysemous problems. Latent semantic indexing (LSI) is able to deal with such problems to some extent. Previous studies have shown that using LSI could reduce the time in clustering a large document set while having little effect on clustering accuracy. However, when conducting clustering upon a small document set, the accuracy is more concerned than efficiency. In this paper, we demonstrate that LSI can improve the clustering accuracy of a small document set and we also recommend the dimensions needed to achieve the best clustering performance.
Object
Latent Semantic Indexing

Similar documents (content)

  1. Zhu, W.Z.; Allen, R.B.: Document clustering using the LSI subspace signature model (2013) 0.33
    0.328929 = sum of:
      0.328929 = product of:
        1.0279032 = sum of:
          0.04057084 = weight(abstract_txt:efficiency in 690) [ClassicSimilarity], result of:
            0.04057084 = score(doc=690,freq=1.0), product of:
              0.085308254 = queryWeight, product of:
                1.0012004 = boost
                6.087415 = idf(docFreq=272, maxDocs=44218)
                0.013997069 = queryNorm
              0.47557932 = fieldWeight in 690, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.087415 = idf(docFreq=272, maxDocs=44218)
                0.078125 = fieldNorm(doc=690)
          0.0498661 = weight(abstract_txt:vector in 690) [ClassicSimilarity], result of:
            0.0498661 = score(doc=690,freq=1.0), product of:
              0.097885564 = queryWeight, product of:
                1.0724692 = boost
                6.5207376 = idf(docFreq=176, maxDocs=44218)
                0.013997069 = queryNorm
              0.5094326 = fieldWeight in 690, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5207376 = idf(docFreq=176, maxDocs=44218)
                0.078125 = fieldNorm(doc=690)
          0.014459495 = weight(abstract_txt:such in 690) [ClassicSimilarity], result of:
            0.014459495 = score(doc=690,freq=1.0), product of:
              0.054029025 = queryWeight, product of:
                1.1268188 = boost
                3.4255946 = idf(docFreq=3909, maxDocs=44218)
                0.013997069 = queryNorm
              0.2676246 = fieldWeight in 690, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4255946 = idf(docFreq=3909, maxDocs=44218)
                0.078125 = fieldNorm(doc=690)
          0.0418605 = weight(abstract_txt:indexing in 690) [ClassicSimilarity], result of:
            0.0418605 = score(doc=690,freq=2.0), product of:
              0.08710665 = queryWeight, product of:
                1.430758 = boost
                4.3495874 = idf(docFreq=1551, maxDocs=44218)
                0.013997069 = queryNorm
              0.48056605 = fieldWeight in 690, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.3495874 = idf(docFreq=1551, maxDocs=44218)
                0.078125 = fieldNorm(doc=690)
          0.055806827 = weight(abstract_txt:semantic in 690) [ClassicSimilarity], result of:
            0.055806827 = score(doc=690,freq=3.0), product of:
              0.09217423 = queryWeight, product of:
                1.471788 = boost
                4.4743214 = idf(docFreq=1369, maxDocs=44218)
                0.013997069 = queryNorm
              0.6054493 = fieldWeight in 690, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.4743214 = idf(docFreq=1369, maxDocs=44218)
                0.078125 = fieldNorm(doc=690)
          0.21336888 = weight(abstract_txt:latent in 690) [ClassicSimilarity], result of:
            0.21336888 = score(doc=690,freq=3.0), product of:
              0.22537479 = queryWeight, product of:
                2.3014057 = boost
                6.996407 = idf(docFreq=109, maxDocs=44218)
                0.013997069 = queryNorm
              0.9467291 = fieldWeight in 690, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.996407 = idf(docFreq=109, maxDocs=44218)
                0.078125 = fieldNorm(doc=690)
          0.123199694 = weight(abstract_txt:document in 690) [ClassicSimilarity], result of:
            0.123199694 = score(doc=690,freq=3.0), product of:
              0.21209855 = queryWeight, product of:
                3.5300379 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.013997069 = queryNorm
              0.5808606 = fieldWeight in 690, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.078125 = fieldNorm(doc=690)
          0.48877084 = weight(abstract_txt:clustering in 690) [ClassicSimilarity], result of:
            0.48877084 = score(doc=690,freq=2.0), product of:
              0.71165895 = queryWeight, product of:
                8.179118 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.013997069 = queryNorm
              0.6868049 = fieldWeight in 690, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.078125 = fieldNorm(doc=690)
        0.32 = coord(8/25)
    
  2. Cai, X.; Li, W.: Enhancing sentence-level clustering with integrated and interactive frameworks for theme-based summarization (2011) 0.26
    0.26104206 = sum of:
      0.26104206 = product of:
        1.3052102 = sum of:
          0.03989288 = weight(abstract_txt:vector in 4770) [ClassicSimilarity], result of:
            0.03989288 = score(doc=4770,freq=1.0), product of:
              0.097885564 = queryWeight, product of:
                1.0724692 = boost
                6.5207376 = idf(docFreq=176, maxDocs=44218)
                0.013997069 = queryNorm
              0.4075461 = fieldWeight in 4770, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5207376 = idf(docFreq=176, maxDocs=44218)
                0.0625 = fieldNorm(doc=4770)
          0.043137617 = weight(abstract_txt:traditionally in 4770) [ClassicSimilarity], result of:
            0.043137617 = score(doc=4770,freq=1.0), product of:
              0.10312386 = queryWeight, product of:
                1.1007916 = boost
                6.6929407 = idf(docFreq=148, maxDocs=44218)
                0.013997069 = queryNorm
              0.4183088 = fieldWeight in 4770, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6929407 = idf(docFreq=148, maxDocs=44218)
                0.0625 = fieldNorm(doc=4770)
          0.011951909 = weight(abstract_txt:using in 4770) [ClassicSimilarity], result of:
            0.011951909 = score(doc=4770,freq=1.0), product of:
              0.055219177 = queryWeight, product of:
                1.139162 = boost
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.013997069 = queryNorm
              0.21644491 = fieldWeight in 4770, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.0625 = fieldNorm(doc=4770)
          0.13938454 = weight(abstract_txt:document in 4770) [ClassicSimilarity], result of:
            0.13938454 = score(doc=4770,freq=6.0), product of:
              0.21209855 = queryWeight, product of:
                3.5300379 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.013997069 = queryNorm
              0.65716875 = fieldWeight in 4770, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=4770)
          1.0708433 = weight(abstract_txt:clustering in 4770) [ClassicSimilarity], result of:
            1.0708433 = score(doc=4770,freq=15.0), product of:
              0.71165895 = queryWeight, product of:
                8.179118 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.013997069 = queryNorm
              1.5047143 = fieldWeight in 4770, product of:
                3.8729835 = tf(freq=15.0), with freq of:
                  15.0 = termFreq=15.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.0625 = fieldNorm(doc=4770)
        0.2 = coord(5/25)
    
  3. Liu, X.; Yu, S.; Janssens, F.; Glänzel, W.; Moreau, Y.; Moor, B.de: Weighted hybrid clustering by combining text mining and bibliometrics on a large-scale journal database (2010) 0.21
    0.21459377 = sum of:
      0.21459377 = product of:
        1.0729688 = sum of:
          0.057375833 = weight(abstract_txt:efficiency in 3464) [ClassicSimilarity], result of:
            0.057375833 = score(doc=3464,freq=2.0), product of:
              0.085308254 = queryWeight, product of:
                1.0012004 = boost
                6.087415 = idf(docFreq=272, maxDocs=44218)
                0.013997069 = queryNorm
              0.6725707 = fieldWeight in 3464, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.087415 = idf(docFreq=272, maxDocs=44218)
                0.078125 = fieldNorm(doc=3464)
          0.042356115 = weight(abstract_txt:mining in 3464) [ClassicSimilarity], result of:
            0.042356115 = score(doc=3464,freq=1.0), product of:
              0.08779284 = queryWeight, product of:
                1.0156757 = boost
                6.1754265 = idf(docFreq=249, maxDocs=44218)
                0.013997069 = queryNorm
              0.4824552 = fieldWeight in 3464, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.1754265 = idf(docFreq=249, maxDocs=44218)
                0.078125 = fieldNorm(doc=3464)
          0.014939888 = weight(abstract_txt:using in 3464) [ClassicSimilarity], result of:
            0.014939888 = score(doc=3464,freq=1.0), product of:
              0.055219177 = queryWeight, product of:
                1.139162 = boost
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.013997069 = queryNorm
              0.27055615 = fieldWeight in 3464, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.078125 = fieldNorm(doc=3464)
          0.043890383 = weight(abstract_txt:improve in 3464) [ClassicSimilarity], result of:
            0.043890383 = score(doc=3464,freq=1.0), product of:
              0.113267325 = queryWeight, product of:
                1.6315216 = boost
                4.9599204 = idf(docFreq=842, maxDocs=44218)
                0.013997069 = queryNorm
              0.3874938 = fieldWeight in 3464, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.9599204 = idf(docFreq=842, maxDocs=44218)
                0.078125 = fieldNorm(doc=3464)
          0.9144066 = weight(abstract_txt:clustering in 3464) [ClassicSimilarity], result of:
            0.9144066 = score(doc=3464,freq=7.0), product of:
              0.71165895 = queryWeight, product of:
                8.179118 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.013997069 = queryNorm
              1.2848943 = fieldWeight in 3464, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.078125 = fieldNorm(doc=3464)
        0.2 = coord(5/25)
    
  4. Shah, B.; Raghavan, V.; Dhatric, P.; Zhao, X.: ¬A cluster-based approach for efficient content-based image retrieval using a similarity-preserving space transformation method (2006) 0.20
    0.20082805 = sum of:
      0.20082805 = product of:
        0.717243 = sum of:
          0.04016308 = weight(abstract_txt:efficiency in 6118) [ClassicSimilarity], result of:
            0.04016308 = score(doc=6118,freq=2.0), product of:
              0.085308254 = queryWeight, product of:
                1.0012004 = boost
                6.087415 = idf(docFreq=272, maxDocs=44218)
                0.013997069 = queryNorm
              0.47079948 = fieldWeight in 6118, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.087415 = idf(docFreq=272, maxDocs=44218)
                0.0546875 = fieldNorm(doc=6118)
          0.03490627 = weight(abstract_txt:vector in 6118) [ClassicSimilarity], result of:
            0.03490627 = score(doc=6118,freq=1.0), product of:
              0.097885564 = queryWeight, product of:
                1.0724692 = boost
                6.5207376 = idf(docFreq=176, maxDocs=44218)
                0.013997069 = queryNorm
              0.35660285 = fieldWeight in 6118, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5207376 = idf(docFreq=176, maxDocs=44218)
                0.0546875 = fieldNorm(doc=6118)
          0.010457921 = weight(abstract_txt:using in 6118) [ClassicSimilarity], result of:
            0.010457921 = score(doc=6118,freq=1.0), product of:
              0.055219177 = queryWeight, product of:
                1.139162 = boost
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.013997069 = queryNorm
              0.18938929 = fieldWeight in 6118, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.0546875 = fieldNorm(doc=6118)
          0.019984242 = weight(abstract_txt:problems in 6118) [ClassicSimilarity], result of:
            0.019984242 = score(doc=6118,freq=1.0), product of:
              0.08503247 = queryWeight, product of:
                1.4136207 = boost
                4.297489 = idf(docFreq=1634, maxDocs=44218)
                0.013997069 = queryNorm
              0.23501894 = fieldWeight in 6118, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.297489 = idf(docFreq=1634, maxDocs=44218)
                0.0546875 = fieldNorm(doc=6118)
          0.020719891 = weight(abstract_txt:indexing in 6118) [ClassicSimilarity], result of:
            0.020719891 = score(doc=6118,freq=1.0), product of:
              0.08710665 = queryWeight, product of:
                1.430758 = boost
                4.3495874 = idf(docFreq=1551, maxDocs=44218)
                0.013997069 = queryNorm
              0.23786807 = fieldWeight in 6118, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3495874 = idf(docFreq=1551, maxDocs=44218)
                0.0546875 = fieldNorm(doc=6118)
          0.107153125 = weight(abstract_txt:accuracy in 6118) [ClassicSimilarity], result of:
            0.107153125 = score(doc=6118,freq=1.0), product of:
              0.32820076 = queryWeight, product of:
                3.9275823 = boost
                5.9700394 = idf(docFreq=306, maxDocs=44218)
                0.013997069 = queryNorm
              0.32648653 = fieldWeight in 6118, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9700394 = idf(docFreq=306, maxDocs=44218)
                0.0546875 = fieldNorm(doc=6118)
          0.4838585 = weight(abstract_txt:clustering in 6118) [ClassicSimilarity], result of:
            0.4838585 = score(doc=6118,freq=4.0), product of:
              0.71165895 = queryWeight, product of:
                8.179118 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.013997069 = queryNorm
              0.6799022 = fieldWeight in 6118, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.0546875 = fieldNorm(doc=6118)
        0.28 = coord(7/25)
    
  5. Cribbin, T.: Discovering latent topical structure by second-order similarity analysis (2011) 0.19
    0.19470634 = sum of:
      0.19470634 = product of:
        0.6084573 = sum of:
          0.035801247 = weight(abstract_txt:reduce in 4470) [ClassicSimilarity], result of:
            0.035801247 = score(doc=4470,freq=1.0), product of:
              0.09107248 = queryWeight, product of:
                1.0344728 = boost
                6.2897153 = idf(docFreq=222, maxDocs=44218)
                0.013997069 = queryNorm
              0.3931072 = fieldWeight in 4470, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.2897153 = idf(docFreq=222, maxDocs=44218)
                0.0625 = fieldNorm(doc=4470)
          0.06909649 = weight(abstract_txt:vector in 4470) [ClassicSimilarity], result of:
            0.06909649 = score(doc=4470,freq=3.0), product of:
              0.097885564 = queryWeight, product of:
                1.0724692 = boost
                6.5207376 = idf(docFreq=176, maxDocs=44218)
                0.013997069 = queryNorm
              0.70589054 = fieldWeight in 4470, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.5207376 = idf(docFreq=176, maxDocs=44218)
                0.0625 = fieldNorm(doc=4470)
          0.07684379 = weight(abstract_txt:synonymous in 4470) [ClassicSimilarity], result of:
            0.07684379 = score(doc=4470,freq=1.0), product of:
              0.1515401 = queryWeight, product of:
                1.3344101 = boost
                8.113368 = idf(docFreq=35, maxDocs=44218)
                0.013997069 = queryNorm
              0.5070855 = fieldWeight in 4470, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.113368 = idf(docFreq=35, maxDocs=44218)
                0.0625 = fieldNorm(doc=4470)
          0.022839133 = weight(abstract_txt:problems in 4470) [ClassicSimilarity], result of:
            0.022839133 = score(doc=4470,freq=1.0), product of:
              0.08503247 = queryWeight, product of:
                1.4136207 = boost
                4.297489 = idf(docFreq=1634, maxDocs=44218)
                0.013997069 = queryNorm
              0.26859307 = fieldWeight in 4470, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.297489 = idf(docFreq=1634, maxDocs=44218)
                0.0625 = fieldNorm(doc=4470)
          0.036452867 = weight(abstract_txt:semantic in 4470) [ClassicSimilarity], result of:
            0.036452867 = score(doc=4470,freq=2.0), product of:
              0.09217423 = queryWeight, product of:
                1.471788 = boost
                4.4743214 = idf(docFreq=1369, maxDocs=44218)
                0.013997069 = queryNorm
              0.39547786 = fieldWeight in 4470, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.4743214 = idf(docFreq=1369, maxDocs=44218)
                0.0625 = fieldNorm(doc=4470)
          0.1398252 = weight(abstract_txt:polysemous in 4470) [ClassicSimilarity], result of:
            0.1398252 = score(doc=4470,freq=1.0), product of:
              0.22586313 = queryWeight, product of:
                1.6291016 = boost
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.013997069 = queryNorm
              0.6190705 = fieldWeight in 4470, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.0625 = fieldNorm(doc=4470)
          0.1706951 = weight(abstract_txt:latent in 4470) [ClassicSimilarity], result of:
            0.1706951 = score(doc=4470,freq=3.0), product of:
              0.22537479 = queryWeight, product of:
                2.3014057 = boost
                6.996407 = idf(docFreq=109, maxDocs=44218)
                0.013997069 = queryNorm
              0.7573833 = fieldWeight in 4470, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.996407 = idf(docFreq=109, maxDocs=44218)
                0.0625 = fieldNorm(doc=4470)
          0.0569035 = weight(abstract_txt:document in 4470) [ClassicSimilarity], result of:
            0.0569035 = score(doc=4470,freq=1.0), product of:
              0.21209855 = queryWeight, product of:
                3.5300379 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.013997069 = queryNorm
              0.26828802 = fieldWeight in 4470, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=4470)
        0.32 = coord(8/25)