Document (#33358)

Author
Rooney, N.
Patterson, D.
Galushka, M.
Dobrynin, V.
Smirnova, E.
Title
¬An investigation into the stability of contextual document clustering
Source
Journal of the American Society for Information Science and Technology. 59(2008) no.2, S.256-266
Year
2008
Abstract
In this article, we assess the effectiveness of Contextual Document Clustering (CDC) as a means of indexing within a dynamic and rapidly changing environment. We simulate a dynamic environment, by splitting two chronologically ordered datasets into time-ordered segments and assessing how the technique performs under two different scenarios. The first is when new documents are added incrementally without reclustering [incremental CDC (iCDC)], and the second is when reclustering is performed [nonincremental CDC (nCDC)]. The datasets are very large, are independent of each other, and belong to two very different domains. We show that CDC itself is effective at clustering very large document corpora, and that, significantly, it lends itself to a very simple, efficient incremental document addition process that is seen to be very stable over time despite the size of the corpus growing considerably. It was seen to be effective at incrementally clustering new documents even when the corpus grew to six times its original size. This is in contrast to what other researchers have found when applying similar simple incremental approaches to document clustering. The stability of iCDC is accounted for by the unique manner in which CDC discovers cluster themes.
Theme
Automatisches Klassifizieren

Similar documents (author)

  1. Patterson, E.L.: ¬The bibliographic control of microforms (1992) 6.17
    6.169457 = sum of:
      6.169457 = weight(author_txt:patterson in 1354) [ClassicSimilarity], result of:
        6.169457 = fieldWeight in 1354, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.871131 = idf(docFreq=5, maxDocs=42740)
          0.625 = fieldNorm(doc=1354)
    
  2. Patterson, C.D.: Origins of systematic serials control : remembering Carolyn Ulrich (1988) 6.17
    6.169457 = sum of:
      6.169457 = weight(author_txt:patterson in 2544) [ClassicSimilarity], result of:
        6.169457 = fieldWeight in 2544, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.871131 = idf(docFreq=5, maxDocs=42740)
          0.625 = fieldNorm(doc=2544)
    
  3. Rohlfing, H.; Schappacher, N.; Patterson, S.J.: ¬Das Zentralarchiv für Mathematiker-Nachlässe an der Niedersächsischen Staats- und Universitätsbibliothek (2003) 3.70
    3.701674 = sum of:
      3.701674 = weight(author_txt:patterson in 3322) [ClassicSimilarity], result of:
        3.701674 = fieldWeight in 3322, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.871131 = idf(docFreq=5, maxDocs=42740)
          0.375 = fieldNorm(doc=3322)
    
  4. Barry, E.; Bedoya, J.K.; Groom, C.; Patterson, L.: Virtual reference in UK academic libraries : the virtual enquiry project 2008-2009 (2010) 3.08
    3.0847285 = sum of:
      3.0847285 = weight(author_txt:patterson in 4967) [ClassicSimilarity], result of:
        3.0847285 = fieldWeight in 4967, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.871131 = idf(docFreq=5, maxDocs=42740)
          0.3125 = fieldNorm(doc=4967)
    

Similar documents (content)

  1. Can, F.: Incremental clustering for dynamic information processing (1993) 0.32
    0.31610304 = sum of:
      0.31610304 = product of:
        0.987822 = sum of:
          0.033753928 = weight(abstract_txt:large in 6627) [ClassicSimilarity], result of:
            0.033753928 = score(doc=6627,freq=1.0), product of:
              0.08057416 = queryWeight, product of:
                1.0681069 = boost
                4.468454 = idf(docFreq=1331, maxDocs=42740)
                0.016881995 = queryNorm
              0.41891754 = fieldWeight in 6627, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.468454 = idf(docFreq=1331, maxDocs=42740)
                0.09375 = fieldNorm(doc=6627)
          0.03777135 = weight(abstract_txt:environment in 6627) [ClassicSimilarity], result of:
            0.03777135 = score(doc=6627,freq=1.0), product of:
              0.08684695 = queryWeight, product of:
                1.1089044 = boost
                4.6391315 = idf(docFreq=1122, maxDocs=42740)
                0.016881995 = queryNorm
              0.43491858 = fieldWeight in 6627, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6391315 = idf(docFreq=1122, maxDocs=42740)
                0.09375 = fieldNorm(doc=6627)
          0.042911638 = weight(abstract_txt:effective in 6627) [ClassicSimilarity], result of:
            0.042911638 = score(doc=6627,freq=1.0), product of:
              0.09455756 = queryWeight, product of:
                1.1570842 = boost
                4.840693 = idf(docFreq=917, maxDocs=42740)
                0.016881995 = queryNorm
              0.45381498 = fieldWeight in 6627, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.840693 = idf(docFreq=917, maxDocs=42740)
                0.09375 = fieldNorm(doc=6627)
          0.073569275 = weight(abstract_txt:dynamic in 6627) [ClassicSimilarity], result of:
            0.073569275 = score(doc=6627,freq=1.0), product of:
              0.13544944 = queryWeight, product of:
                1.3848586 = boost
                5.7935934 = idf(docFreq=353, maxDocs=42740)
                0.016881995 = queryNorm
              0.54314935 = fieldWeight in 6627, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7935934 = idf(docFreq=353, maxDocs=42740)
                0.09375 = fieldNorm(doc=6627)
          0.07419199 = weight(abstract_txt:document in 6627) [ClassicSimilarity], result of:
            0.07419199 = score(doc=6627,freq=1.0), product of:
              0.18486905 = queryWeight, product of:
                2.5581083 = boost
                4.280766 = idf(docFreq=1606, maxDocs=42740)
                0.016881995 = queryNorm
              0.40132183 = fieldWeight in 6627, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.280766 = idf(docFreq=1606, maxDocs=42740)
                0.09375 = fieldNorm(doc=6627)
          0.29047212 = weight(abstract_txt:incremental in 6627) [ClassicSimilarity], result of:
            0.29047212 = score(doc=6627,freq=1.0), product of:
              0.38732862 = queryWeight, product of:
                2.8681526 = boost
                7.999329 = idf(docFreq=38, maxDocs=42740)
                0.016881995 = queryNorm
              0.7499371 = fieldWeight in 6627, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.999329 = idf(docFreq=38, maxDocs=42740)
                0.09375 = fieldNorm(doc=6627)
          0.11320969 = weight(abstract_txt:very in 6627) [ClassicSimilarity], result of:
            0.11320969 = score(doc=6627,freq=1.0), product of:
              0.24502775 = queryWeight, product of:
                2.9450622 = boost
                4.928299 = idf(docFreq=840, maxDocs=42740)
                0.016881995 = queryNorm
              0.46202803 = fieldWeight in 6627, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.928299 = idf(docFreq=840, maxDocs=42740)
                0.09375 = fieldNorm(doc=6627)
          0.32194203 = weight(abstract_txt:clustering in 6627) [ClassicSimilarity], result of:
            0.32194203 = score(doc=6627,freq=2.0), product of:
              0.3903624 = queryWeight, product of:
                3.717242 = boost
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.016881995 = queryNorm
              0.824726 = fieldWeight in 6627, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.09375 = fieldNorm(doc=6627)
        0.32 = coord(8/25)
    
  2. Cai, X.; Li, W.: Enhancing sentence-level clustering with integrated and interactive frameworks for theme-based summarization (2011) 0.14
    0.13954537 = sum of:
      0.13954537 = product of:
        0.8721585 = sum of:
          0.08513423 = weight(abstract_txt:discovers in 1771) [ClassicSimilarity], result of:
            0.08513423 = score(doc=1771,freq=1.0), product of:
              0.1552744 = queryWeight, product of:
                1.0484598 = boost
                8.772519 = idf(docFreq=17, maxDocs=42740)
                0.016881995 = queryNorm
              0.54828244 = fieldWeight in 1771, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.772519 = idf(docFreq=17, maxDocs=42740)
                0.0625 = fieldNorm(doc=1771)
          0.07808622 = weight(abstract_txt:datasets in 1771) [ClassicSimilarity], result of:
            0.07808622 = score(doc=1771,freq=1.0), product of:
              0.18468146 = queryWeight, product of:
                1.6170688 = boost
                6.765051 = idf(docFreq=133, maxDocs=42740)
                0.016881995 = queryNorm
              0.42281568 = fieldWeight in 1771, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.765051 = idf(docFreq=133, maxDocs=42740)
                0.0625 = fieldNorm(doc=1771)
          0.12115501 = weight(abstract_txt:document in 1771) [ClassicSimilarity], result of:
            0.12115501 = score(doc=1771,freq=6.0), product of:
              0.18486905 = queryWeight, product of:
                2.5581083 = boost
                4.280766 = idf(docFreq=1606, maxDocs=42740)
                0.016881995 = queryNorm
              0.6553558 = fieldWeight in 1771, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.280766 = idf(docFreq=1606, maxDocs=42740)
                0.0625 = fieldNorm(doc=1771)
          0.58778304 = weight(abstract_txt:clustering in 1771) [ClassicSimilarity], result of:
            0.58778304 = score(doc=1771,freq=15.0), product of:
              0.3903624 = queryWeight, product of:
                3.717242 = boost
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.016881995 = queryNorm
              1.5057367 = fieldWeight in 1771, product of:
                3.8729835 = tf(freq=15.0), with freq of:
                  15.0 = termFreq=15.0
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.0625 = fieldNorm(doc=1771)
        0.16 = coord(4/25)
    
  3. Ruocco, A.S.; Frieder, O.: Clustering and classification of large document bases in a parallel environment (1997) 0.11
    0.112775035 = sum of:
      0.112775035 = product of:
        0.704844 = sum of:
          0.033753928 = weight(abstract_txt:large in 2662) [ClassicSimilarity], result of:
            0.033753928 = score(doc=2662,freq=1.0), product of:
              0.08057416 = queryWeight, product of:
                1.0681069 = boost
                4.468454 = idf(docFreq=1331, maxDocs=42740)
                0.016881995 = queryNorm
              0.41891754 = fieldWeight in 2662, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.468454 = idf(docFreq=1331, maxDocs=42740)
                0.09375 = fieldNorm(doc=2662)
          0.08729099 = weight(abstract_txt:corpus in 2662) [ClassicSimilarity], result of:
            0.08729099 = score(doc=2662,freq=1.0), product of:
              0.15180723 = queryWeight, product of:
                1.4660982 = boost
                6.1334615 = idf(docFreq=251, maxDocs=42740)
                0.016881995 = queryNorm
              0.575012 = fieldWeight in 2662, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.1334615 = idf(docFreq=251, maxDocs=42740)
                0.09375 = fieldNorm(doc=2662)
          0.12850428 = weight(abstract_txt:document in 2662) [ClassicSimilarity], result of:
            0.12850428 = score(doc=2662,freq=3.0), product of:
              0.18486905 = queryWeight, product of:
                2.5581083 = boost
                4.280766 = idf(docFreq=1606, maxDocs=42740)
                0.016881995 = queryNorm
              0.6951097 = fieldWeight in 2662, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.280766 = idf(docFreq=1606, maxDocs=42740)
                0.09375 = fieldNorm(doc=2662)
          0.4552948 = weight(abstract_txt:clustering in 2662) [ClassicSimilarity], result of:
            0.4552948 = score(doc=2662,freq=4.0), product of:
              0.3903624 = queryWeight, product of:
                3.717242 = boost
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.016881995 = queryNorm
              1.1663387 = fieldWeight in 2662, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.09375 = fieldNorm(doc=2662)
        0.16 = coord(4/25)
    
  4. Zhan, J.; Loh, H.T.: Using latent semantic indexing to improve the accuracy of document clustering (2007) 0.11
    0.11190743 = sum of:
      0.11190743 = product of:
        0.69942147 = sum of:
          0.028128276 = weight(abstract_txt:large in 2265) [ClassicSimilarity], result of:
            0.028128276 = score(doc=2265,freq=1.0), product of:
              0.08057416 = queryWeight, product of:
                1.0681069 = boost
                4.468454 = idf(docFreq=1331, maxDocs=42740)
                0.016881995 = queryNorm
              0.34909797 = fieldWeight in 2265, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.468454 = idf(docFreq=1331, maxDocs=42740)
                0.078125 = fieldNorm(doc=2265)
          0.045724615 = weight(abstract_txt:when in 2265) [ClassicSimilarity], result of:
            0.045724615 = score(doc=2265,freq=1.0), product of:
              0.14034936 = queryWeight, product of:
                1.9935955 = boost
                4.1701303 = idf(docFreq=1794, maxDocs=42740)
                0.016881995 = queryNorm
              0.32579142 = fieldWeight in 2265, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1701303 = idf(docFreq=1794, maxDocs=42740)
                0.078125 = fieldNorm(doc=2265)
          0.1236533 = weight(abstract_txt:document in 2265) [ClassicSimilarity], result of:
            0.1236533 = score(doc=2265,freq=4.0), product of:
              0.18486905 = queryWeight, product of:
                2.5581083 = boost
                4.280766 = idf(docFreq=1606, maxDocs=42740)
                0.016881995 = queryNorm
              0.6688697 = fieldWeight in 2265, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.280766 = idf(docFreq=1606, maxDocs=42740)
                0.078125 = fieldNorm(doc=2265)
          0.5019153 = weight(abstract_txt:clustering in 2265) [ClassicSimilarity], result of:
            0.5019153 = score(doc=2265,freq=7.0), product of:
              0.3903624 = queryWeight, product of:
                3.717242 = boost
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.016881995 = queryNorm
              1.2857674 = fieldWeight in 2265, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.078125 = fieldNorm(doc=2265)
        0.16 = coord(4/25)
    
  5. Zamir, O.; Etzioni, O.: Grouper : a dynamic clustering interface to Web search results (1999) 0.09
    0.08808418 = sum of:
      0.08808418 = product of:
        0.55052614 = sum of:
          0.035759695 = weight(abstract_txt:effective in 208) [ClassicSimilarity], result of:
            0.035759695 = score(doc=208,freq=1.0), product of:
              0.09455756 = queryWeight, product of:
                1.1570842 = boost
                4.840693 = idf(docFreq=917, maxDocs=42740)
                0.016881995 = queryNorm
              0.37817913 = fieldWeight in 208, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.840693 = idf(docFreq=917, maxDocs=42740)
                0.078125 = fieldNorm(doc=208)
          0.047918096 = weight(abstract_txt:simple in 208) [ClassicSimilarity], result of:
            0.047918096 = score(doc=208,freq=1.0), product of:
              0.114930004 = queryWeight, product of:
                1.2756559 = boost
                5.336741 = idf(docFreq=558, maxDocs=42740)
                0.016881995 = queryNorm
              0.41693288 = fieldWeight in 208, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.336741 = idf(docFreq=558, maxDocs=42740)
                0.078125 = fieldNorm(doc=208)
          0.08743609 = weight(abstract_txt:document in 208) [ClassicSimilarity], result of:
            0.08743609 = score(doc=208,freq=2.0), product of:
              0.18486905 = queryWeight, product of:
                2.5581083 = boost
                4.280766 = idf(docFreq=1606, maxDocs=42740)
                0.016881995 = queryNorm
              0.4729623 = fieldWeight in 208, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.280766 = idf(docFreq=1606, maxDocs=42740)
                0.078125 = fieldNorm(doc=208)
          0.3794123 = weight(abstract_txt:clustering in 208) [ClassicSimilarity], result of:
            0.3794123 = score(doc=208,freq=4.0), product of:
              0.3903624 = queryWeight, product of:
                3.717242 = boost
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.016881995 = queryNorm
              0.97194886 = fieldWeight in 208, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.078125 = fieldNorm(doc=208)
        0.16 = coord(4/25)