Document (#39643)

Author
Giannella, C.
Title
¬An improved algorithm for unsupervised decomposition of a multi-author document
Source
Journal of the Association for Information Science and Technology. 67(2016) no.2, S.400-411
Year
2016
Abstract
This article addresses the problem of unsupervised decomposition of a multi-author text document: identifying the sentences written by each author assuming the number of authors is unknown. An approach, BayesAD, is developed for solving this problem: apply a Bayesian segmentation algorithm, followed by a segment clustering algorithm. Results are presented from an empirical comparison between BayesAD and AK, a modified version of an approach published by Akiva and Koppel in 2013. BayesAD exhibited greater accuracy than AK in all experiments. However, BayesAD has a parameter that needs to be set and which had a nontrivial impact on accuracy. Developing an effective method for eliminating this need would be a fruitful direction for future work. When controlling for topic, the accuracy levels of BayesAD and AK were, in all but one case, worse than a baseline approach wherein one author was assumed to write all sentences in the input text document. Hence, room for improved solutions exists.
Content
Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23375/abstract.

Similar documents (content)

  1. Aldebei, K.; He, X.; Jia, W.; Yeh, W.: SUDMAD: Sequential and unsupervised decomposition of a multi-author document based on a hidden markov model (2018) 0.41
    0.40517458 = sum of:
      0.40517458 = product of:
        1.0129365 = sum of:
          0.02411065 = weight(abstract_txt:than in 4037) [ClassicSimilarity], result of:
            0.02411065 = score(doc=4037,freq=2.0), product of:
              0.08003662 = queryWeight, product of:
                1.0269127 = boost
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.020009585 = queryNorm
              0.30124524 = fieldWeight in 4037, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4037)
          0.07146382 = weight(abstract_txt:room in 4037) [ClassicSimilarity], result of:
            0.07146382 = score(doc=4037,freq=1.0), product of:
              0.16514811 = queryWeight, product of:
                1.043064 = boost
                7.912698 = idf(docFreq=43, maxDocs=44218)
                0.020009585 = queryNorm
              0.43272567 = fieldWeight in 4037, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.912698 = idf(docFreq=43, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4037)
          0.045470882 = weight(abstract_txt:approach in 4037) [ClassicSimilarity], result of:
            0.045470882 = score(doc=4037,freq=4.0), product of:
              0.11100063 = queryWeight, product of:
                1.4811448 = boost
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.020009585 = queryNorm
              0.40964526 = fieldWeight in 4037, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4037)
          0.06059654 = weight(abstract_txt:multi in 4037) [ClassicSimilarity], result of:
            0.06059654 = score(doc=4037,freq=1.0), product of:
              0.18640518 = queryWeight, product of:
                1.5671774 = boost
                5.9443145 = idf(docFreq=314, maxDocs=44218)
                0.020009585 = queryNorm
              0.3250797 = fieldWeight in 4037, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9443145 = idf(docFreq=314, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4037)
          0.07653897 = weight(abstract_txt:document in 4037) [ClassicSimilarity], result of:
            0.07653897 = score(doc=4037,freq=5.0), product of:
              0.14581032 = queryWeight, product of:
                1.6975747 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.020009585 = queryNorm
              0.5249215 = fieldWeight in 4037, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4037)
          0.17113087 = weight(abstract_txt:sentences in 4037) [ClassicSimilarity], result of:
            0.17113087 = score(doc=4037,freq=3.0), product of:
              0.25822875 = queryWeight, product of:
                1.8445544 = boost
                6.996407 = idf(docFreq=109, maxDocs=44218)
                0.020009585 = queryNorm
              0.66271037 = fieldWeight in 4037, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.996407 = idf(docFreq=109, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4037)
          0.18047261 = weight(abstract_txt:unsupervised in 4037) [ClassicSimilarity], result of:
            0.18047261 = score(doc=4037,freq=2.0), product of:
              0.30626005 = queryWeight, product of:
                2.008789 = boost
                7.61935 = idf(docFreq=58, maxDocs=44218)
                0.020009585 = queryNorm
              0.589279 = fieldWeight in 4037, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.61935 = idf(docFreq=58, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4037)
          0.20213021 = weight(abstract_txt:decomposition in 4037) [ClassicSimilarity], result of:
            0.20213021 = score(doc=4037,freq=2.0), product of:
              0.33029622 = queryWeight, product of:
                2.086128 = boost
                7.912698 = idf(docFreq=43, maxDocs=44218)
                0.020009585 = queryNorm
              0.6119665 = fieldWeight in 4037, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.912698 = idf(docFreq=43, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4037)
          0.08037062 = weight(abstract_txt:algorithm in 4037) [ClassicSimilarity], result of:
            0.08037062 = score(doc=4037,freq=1.0), product of:
              0.2575855 = queryWeight, product of:
                2.2562928 = boost
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.020009585 = queryNorm
              0.31201532 = fieldWeight in 4037, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4037)
          0.100651294 = weight(abstract_txt:author in 4037) [ClassicSimilarity], result of:
            0.100651294 = score(doc=4037,freq=2.0), product of:
              0.2614402 = queryWeight, product of:
                2.6247644 = boost
                4.9778743 = idf(docFreq=827, maxDocs=44218)
                0.020009585 = queryNorm
              0.38498783 = fieldWeight in 4037, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.9778743 = idf(docFreq=827, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4037)
        0.4 = coord(10/25)
    
  2. Koppel, M.; Winter, Y.: Determining if two documents are written by the same author (2014) 0.15
    0.14777651 = sum of:
      0.14777651 = product of:
        0.73888254 = sum of:
          0.08869472 = weight(abstract_txt:problem in 1602) [ClassicSimilarity], result of:
            0.08869472 = score(doc=1602,freq=3.0), product of:
              0.10496171 = queryWeight, product of:
                1.1759926 = boost
                4.460548 = idf(docFreq=1388, maxDocs=44218)
                0.020009585 = queryNorm
              0.8450198 = fieldWeight in 1602, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.460548 = idf(docFreq=1388, maxDocs=44218)
                0.109375 = fieldNorm(doc=1602)
          0.068458535 = weight(abstract_txt:document in 1602) [ClassicSimilarity], result of:
            0.068458535 = score(doc=1602,freq=1.0), product of:
              0.14581032 = queryWeight, product of:
                1.6975747 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.020009585 = queryNorm
              0.46950403 = fieldWeight in 1602, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.109375 = fieldNorm(doc=1602)
          0.25522682 = weight(abstract_txt:unsupervised in 1602) [ClassicSimilarity], result of:
            0.25522682 = score(doc=1602,freq=1.0), product of:
              0.30626005 = queryWeight, product of:
                2.008789 = boost
                7.61935 = idf(docFreq=58, maxDocs=44218)
                0.020009585 = queryNorm
              0.8333664 = fieldWeight in 1602, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.61935 = idf(docFreq=58, maxDocs=44218)
                0.109375 = fieldNorm(doc=1602)
          0.18416004 = weight(abstract_txt:accuracy in 1602) [ClassicSimilarity], result of:
            0.18416004 = score(doc=1602,freq=1.0), product of:
              0.28203315 = queryWeight, product of:
                2.3609395 = boost
                5.9700394 = idf(docFreq=306, maxDocs=44218)
                0.020009585 = queryNorm
              0.65297306 = fieldWeight in 1602, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9700394 = idf(docFreq=306, maxDocs=44218)
                0.109375 = fieldNorm(doc=1602)
          0.14234242 = weight(abstract_txt:author in 1602) [ClassicSimilarity], result of:
            0.14234242 = score(doc=1602,freq=1.0), product of:
              0.2614402 = queryWeight, product of:
                2.6247644 = boost
                4.9778743 = idf(docFreq=827, maxDocs=44218)
                0.020009585 = queryNorm
              0.544455 = fieldWeight in 1602, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.9778743 = idf(docFreq=827, maxDocs=44218)
                0.109375 = fieldNorm(doc=1602)
        0.2 = coord(5/25)
    
  3. Lochbaum, K.E.; Streeter, A.R.: Comparing and combining the effectiveness of latent semantic indexing and the ordinary vector space model for information retrieval (1989) 0.12
    0.12231805 = sum of:
      0.12231805 = product of:
        0.61159027 = sum of:
          0.09176731 = weight(abstract_txt:write in 3458) [ClassicSimilarity], result of:
            0.09176731 = score(doc=3458,freq=1.0), product of:
              0.15381788 = queryWeight, product of:
                1.0066478 = boost
                7.636444 = idf(docFreq=57, maxDocs=44218)
                0.020009585 = queryNorm
              0.5965972 = fieldWeight in 3458, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.636444 = idf(docFreq=57, maxDocs=44218)
                0.078125 = fieldNorm(doc=3458)
          0.024355434 = weight(abstract_txt:than in 3458) [ClassicSimilarity], result of:
            0.024355434 = score(doc=3458,freq=1.0), product of:
              0.08003662 = queryWeight, product of:
                1.0269127 = boost
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.020009585 = queryNorm
              0.30430365 = fieldWeight in 3458, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.078125 = fieldNorm(doc=3458)
          0.0751672 = weight(abstract_txt:improved in 3458) [ClassicSimilarity], result of:
            0.0751672 = score(doc=3458,freq=1.0), product of:
              0.16965906 = queryWeight, product of:
                1.4951257 = boost
                5.6710215 = idf(docFreq=413, maxDocs=44218)
                0.020009585 = queryNorm
              0.44304854 = fieldWeight in 3458, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.6710215 = idf(docFreq=413, maxDocs=44218)
                0.078125 = fieldNorm(doc=3458)
          0.28875747 = weight(abstract_txt:decomposition in 3458) [ClassicSimilarity], result of:
            0.28875747 = score(doc=3458,freq=2.0), product of:
              0.33029622 = queryWeight, product of:
                2.086128 = boost
                7.912698 = idf(docFreq=43, maxDocs=44218)
                0.020009585 = queryNorm
              0.8742379 = fieldWeight in 3458, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.912698 = idf(docFreq=43, maxDocs=44218)
                0.078125 = fieldNorm(doc=3458)
          0.13154289 = weight(abstract_txt:accuracy in 3458) [ClassicSimilarity], result of:
            0.13154289 = score(doc=3458,freq=1.0), product of:
              0.28203315 = queryWeight, product of:
                2.3609395 = boost
                5.9700394 = idf(docFreq=306, maxDocs=44218)
                0.020009585 = queryNorm
              0.46640933 = fieldWeight in 3458, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9700394 = idf(docFreq=306, maxDocs=44218)
                0.078125 = fieldNorm(doc=3458)
        0.2 = coord(5/25)
    
  4. Li, T.; Zhu, S.; Ogihara, M.: Text categorization via generalized discriminant analysis (2008) 0.12
    0.12198785 = sum of:
      0.12198785 = product of:
        0.5082827 = sum of:
          0.04875397 = weight(abstract_txt:text in 2119) [ClassicSimilarity], result of:
            0.04875397 = score(doc=2119,freq=5.0), product of:
              0.08626768 = queryWeight, product of:
                1.0661376 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.020009585 = queryNorm
              0.5651476 = fieldWeight in 2119, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=2119)
          0.050682697 = weight(abstract_txt:problem in 2119) [ClassicSimilarity], result of:
            0.050682697 = score(doc=2119,freq=3.0), product of:
              0.10496171 = queryWeight, product of:
                1.1759926 = boost
                4.460548 = idf(docFreq=1388, maxDocs=44218)
                0.020009585 = queryNorm
              0.48286846 = fieldWeight in 2119, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.460548 = idf(docFreq=1388, maxDocs=44218)
                0.0625 = fieldNorm(doc=2119)
          0.036746018 = weight(abstract_txt:approach in 2119) [ClassicSimilarity], result of:
            0.036746018 = score(doc=2119,freq=2.0), product of:
              0.11100063 = queryWeight, product of:
                1.4811448 = boost
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.020009585 = queryNorm
              0.33104333 = fieldWeight in 2119, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.0625 = fieldNorm(doc=2119)
          0.16963498 = weight(abstract_txt:multi in 2119) [ClassicSimilarity], result of:
            0.16963498 = score(doc=2119,freq=6.0), product of:
              0.18640518 = queryWeight, product of:
                1.5671774 = boost
                5.9443145 = idf(docFreq=314, maxDocs=44218)
                0.020009585 = queryNorm
              0.91003364 = fieldWeight in 2119, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                5.9443145 = idf(docFreq=314, maxDocs=44218)
                0.0625 = fieldNorm(doc=2119)
          0.03911916 = weight(abstract_txt:document in 2119) [ClassicSimilarity], result of:
            0.03911916 = score(doc=2119,freq=1.0), product of:
              0.14581032 = queryWeight, product of:
                1.6975747 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.020009585 = queryNorm
              0.26828802 = fieldWeight in 2119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=2119)
          0.16334589 = weight(abstract_txt:decomposition in 2119) [ClassicSimilarity], result of:
            0.16334589 = score(doc=2119,freq=1.0), product of:
              0.33029622 = queryWeight, product of:
                2.086128 = boost
                7.912698 = idf(docFreq=43, maxDocs=44218)
                0.020009585 = queryNorm
              0.4945436 = fieldWeight in 2119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.912698 = idf(docFreq=43, maxDocs=44218)
                0.0625 = fieldNorm(doc=2119)
        0.24 = coord(6/25)
    
  5. D'Angelo, C.A.; Giuffrida, C.; Abramo, G.: ¬A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments (2011) 0.12
    0.121949926 = sum of:
      0.121949926 = product of:
        0.5081247 = sum of:
          0.024355434 = weight(abstract_txt:than in 4190) [ClassicSimilarity], result of:
            0.024355434 = score(doc=4190,freq=1.0), product of:
              0.08003662 = queryWeight, product of:
                1.0269127 = boost
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.020009585 = queryNorm
              0.30430365 = fieldWeight in 4190, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.078125 = fieldNorm(doc=4190)
          0.03657709 = weight(abstract_txt:problem in 4190) [ClassicSimilarity], result of:
            0.03657709 = score(doc=4190,freq=1.0), product of:
              0.10496171 = queryWeight, product of:
                1.1759926 = boost
                4.460548 = idf(docFreq=1388, maxDocs=44218)
                0.020009585 = queryNorm
              0.3484803 = fieldWeight in 4190, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.460548 = idf(docFreq=1388, maxDocs=44218)
                0.078125 = fieldNorm(doc=4190)
          0.045932524 = weight(abstract_txt:approach in 4190) [ClassicSimilarity], result of:
            0.045932524 = score(doc=4190,freq=2.0), product of:
              0.11100063 = queryWeight, product of:
                1.4811448 = boost
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.020009585 = queryNorm
              0.41380417 = fieldWeight in 4190, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.078125 = fieldNorm(doc=4190)
          0.0751672 = weight(abstract_txt:improved in 4190) [ClassicSimilarity], result of:
            0.0751672 = score(doc=4190,freq=1.0), product of:
              0.16965906 = queryWeight, product of:
                1.4951257 = boost
                5.6710215 = idf(docFreq=413, maxDocs=44218)
                0.020009585 = queryNorm
              0.44304854 = fieldWeight in 4190, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.6710215 = idf(docFreq=413, maxDocs=44218)
                0.078125 = fieldNorm(doc=4190)
          0.18230487 = weight(abstract_txt:unsupervised in 4190) [ClassicSimilarity], result of:
            0.18230487 = score(doc=4190,freq=1.0), product of:
              0.30626005 = queryWeight, product of:
                2.008789 = boost
                7.61935 = idf(docFreq=58, maxDocs=44218)
                0.020009585 = queryNorm
              0.5952617 = fieldWeight in 4190, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.61935 = idf(docFreq=58, maxDocs=44218)
                0.078125 = fieldNorm(doc=4190)
          0.14378756 = weight(abstract_txt:author in 4190) [ClassicSimilarity], result of:
            0.14378756 = score(doc=4190,freq=2.0), product of:
              0.2614402 = queryWeight, product of:
                2.6247644 = boost
                4.9778743 = idf(docFreq=827, maxDocs=44218)
                0.020009585 = queryNorm
              0.5499826 = fieldWeight in 4190, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.9778743 = idf(docFreq=827, maxDocs=44218)
                0.078125 = fieldNorm(doc=4190)
        0.24 = coord(6/25)