Document (#30291)

Author
Wu, Y.-f.B.
Li, Q.
Bot, R.S.
Chen, X.
Title
Finding nuggets in documents : a machine learning approach
Source
Journal of the American Society for Information Science and Technology. 57(2006) no.6, S.740-752
Year
2006
Abstract
Document keyphrases provide a concise summary of a document's content, offering semantic metadata summarizing a document. They can be used in many applications related to knowledge management and text mining, such as automatic text summarization, development of search engines, document clustering, document classification, thesaurus construction, and browsing interfaces. Because only a small portion of documents have keyphrases assigned by authors, and it is time-consuming and costly to manually assign keyphrases to documents, it is necessary to develop an algorithm to automatically generate keyphrases for documents. This paper describes a Keyphrase Identification Program (KIP), which extracts document keyphrases by using prior positive samples of human identified phrases to assign weights to the candidate keyphrases. The logic of our algorithm is: The more keywords a candidate keyphrase contains and the more significant these keywords are, the more likely this candidate phrase is a keyphrase. KIP's learning function can enrich the glossary database by automatically adding new identified keyphrases to the database. KIP's personalization feature will let the user build a glossary database specifically suitable for the area of his/her interest. The evaluation results show that KIP's performance is better than the systems we compared to and that the learning function is effective.
Theme
Automatisches Abstracting

Similar documents (author)

  1. Chen, Y.N.; Chen, S.J.: ¬A metadata practice of the OFLA FRBR model : a case study for the National Palace Museum in Taipai (2004) 4.35
    4.3499155 = sum of:
      4.3499155 = weight(author_txt:chen in 3384) [ClassicSimilarity], result of:
        4.3499155 = fieldWeight in 3384, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.1517096 = idf(docFreq=255, maxDocs=44218)
          0.5 = fieldNorm(doc=3384)
    
  2. Chen, C.C.; Chen, H.H.; Chen, K.H.: ¬The design of the XML/Metadata management system (2000) 4.00
    3.9956524 = sum of:
      3.9956524 = weight(author_txt:chen in 4633) [ClassicSimilarity], result of:
        3.9956524 = fieldWeight in 4633, product of:
          1.7320508 = tf(freq=3.0), with freq of:
            3.0 = termFreq=3.0
          6.1517096 = idf(docFreq=255, maxDocs=44218)
          0.375 = fieldNorm(doc=4633)
    
  3. Chen, W.Y.: Observations on cataloguing and classification (1991) 3.84
    3.8448186 = sum of:
      3.8448186 = weight(author_txt:chen in 4184) [ClassicSimilarity], result of:
        3.8448186 = fieldWeight in 4184, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.1517096 = idf(docFreq=255, maxDocs=44218)
          0.625 = fieldNorm(doc=4184)
    
  4. Chen, H.: Knowledge-based document retrieval : framework and design (1992) 3.84
    3.8448186 = sum of:
      3.8448186 = weight(author_txt:chen in 5283) [ClassicSimilarity], result of:
        3.8448186 = fieldWeight in 5283, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.1517096 = idf(docFreq=255, maxDocs=44218)
          0.625 = fieldNorm(doc=5283)
    
  5. Chen, P.S.: On inference rules of logic-based information retrieval systems (1994) 3.84
    3.8448186 = sum of:
      3.8448186 = weight(author_txt:chen in 6731) [ClassicSimilarity], result of:
        3.8448186 = fieldWeight in 6731, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.1517096 = idf(docFreq=255, maxDocs=44218)
          0.625 = fieldNorm(doc=6731)
    

Similar documents (content)

  1. Jones, S.; Paynter, G.W.: Automatic extractionof document keyphrases for use in digital libraries : evaluations and applications (2002) 0.92
    0.91753983 = sum of:
      0.91753983 = product of:
        1.9115413 = sum of:
          0.031507894 = weight(abstract_txt:consuming in 601) [ClassicSimilarity], result of:
            0.031507894 = score(doc=601,freq=1.0), product of:
              0.06949406 = queryWeight, product of:
                7.2542357 = idf(docFreq=84, maxDocs=44218)
                0.009579791 = queryNorm
              0.45338973 = fieldWeight in 601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.2542357 = idf(docFreq=84, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
          0.038354553 = weight(abstract_txt:costly in 601) [ClassicSimilarity], result of:
            0.038354553 = score(doc=601,freq=1.0), product of:
              0.079228126 = queryWeight, product of:
                1.0677408 = boost
                7.7456436 = idf(docFreq=51, maxDocs=44218)
                0.009579791 = queryNorm
              0.48410273 = fieldWeight in 601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.7456436 = idf(docFreq=51, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
          0.04054269 = weight(abstract_txt:concise in 601) [ClassicSimilarity], result of:
            0.04054269 = score(doc=601,freq=1.0), product of:
              0.0822135 = queryWeight, product of:
                1.0876714 = boost
                7.890225 = idf(docFreq=44, maxDocs=44218)
                0.009579791 = queryNorm
              0.49313906 = fieldWeight in 601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.890225 = idf(docFreq=44, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
          0.027717417 = weight(abstract_txt:automatically in 601) [ClassicSimilarity], result of:
            0.027717417 = score(doc=601,freq=1.0), product of:
              0.080385916 = queryWeight, product of:
                1.5210067 = boost
                5.5168705 = idf(docFreq=482, maxDocs=44218)
                0.009579791 = queryNorm
              0.3448044 = fieldWeight in 601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5168705 = idf(docFreq=482, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
          0.028627302 = weight(abstract_txt:function in 601) [ClassicSimilarity], result of:
            0.028627302 = score(doc=601,freq=1.0), product of:
              0.082135655 = queryWeight, product of:
                1.5374713 = boost
                5.5765896 = idf(docFreq=454, maxDocs=44218)
                0.009579791 = queryNorm
              0.34853685 = fieldWeight in 601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5765896 = idf(docFreq=454, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
          0.030657584 = weight(abstract_txt:algorithm in 601) [ClassicSimilarity], result of:
            0.030657584 = score(doc=601,freq=1.0), product of:
              0.08597458 = queryWeight, product of:
                1.5729908 = boost
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.009579791 = queryNorm
              0.35658893 = fieldWeight in 601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.705423 = idf(docFreq=399, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
          0.0358934 = weight(abstract_txt:keywords in 601) [ClassicSimilarity], result of:
            0.0358934 = score(doc=601,freq=1.0), product of:
              0.09550391 = queryWeight, product of:
                1.6578748 = boost
                6.0133076 = idf(docFreq=293, maxDocs=44218)
                0.009579791 = queryNorm
              0.37583172 = fieldWeight in 601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0133076 = idf(docFreq=293, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
          0.06460893 = weight(abstract_txt:assign in 601) [ClassicSimilarity], result of:
            0.06460893 = score(doc=601,freq=1.0), product of:
              0.14132093 = queryWeight, product of:
                2.0167143 = boost
                7.314861 = idf(docFreq=79, maxDocs=44218)
                0.009579791 = queryNorm
              0.4571788 = fieldWeight in 601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.314861 = idf(docFreq=79, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
          0.023110423 = weight(abstract_txt:documents in 601) [ClassicSimilarity], result of:
            0.023110423 = score(doc=601,freq=1.0), product of:
              0.08972085 = queryWeight, product of:
                2.2724946 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.009579791 = queryNorm
              0.2575814 = fieldWeight in 601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
          0.05653778 = weight(abstract_txt:document in 601) [ClassicSimilarity], result of:
            0.05653778 = score(doc=601,freq=3.0), product of:
              0.12166814 = queryWeight, product of:
                2.958691 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.009579791 = queryNorm
              0.46468848 = fieldWeight in 601, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
          0.26666716 = weight(abstract_txt:keyphrase in 601) [ClassicSimilarity], result of:
            0.26666716 = score(doc=601,freq=2.0), product of:
              0.33037832 = queryWeight, product of:
                3.7765265 = boost
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.009579791 = queryNorm
              0.8071569 = fieldWeight in 601, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
          1.2673162 = weight(abstract_txt:keyphrases in 601) [ClassicSimilarity], result of:
            1.2673162 = score(doc=601,freq=7.0), product of:
              0.81581455 = queryWeight, product of:
                9.065064 = boost
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.009579791 = queryNorm
              1.5534366 = fieldWeight in 601, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
        0.48 = coord(12/25)
    
  2. Jiang, Y.; Meng, R.; Huang, Y.; Lu, W.; Liu, J.: Generating keyphrases for readers : a controllable keyphrase generation framework (2023) 0.29
    0.29151827 = sum of:
      0.29151827 = product of:
        1.4575913 = sum of:
          0.010916034 = weight(abstract_txt:text in 1012) [ClassicSimilarity], result of:
            0.010916034 = score(doc=1012,freq=1.0), product of:
              0.04319048 = queryWeight, product of:
                1.1148981 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.009579791 = queryNorm
              0.25274166 = fieldWeight in 1012, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=1012)
          0.040485118 = weight(abstract_txt:function in 1012) [ClassicSimilarity], result of:
            0.040485118 = score(doc=1012,freq=2.0), product of:
              0.082135655 = queryWeight, product of:
                1.5374713 = boost
                5.5765896 = idf(docFreq=454, maxDocs=44218)
                0.009579791 = queryNorm
              0.49290553 = fieldWeight in 1012, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.5765896 = idf(docFreq=454, maxDocs=44218)
                0.0625 = fieldNorm(doc=1012)
          0.026551306 = weight(abstract_txt:learning in 1012) [ClassicSimilarity], result of:
            0.026551306 = score(doc=1012,freq=1.0), product of:
              0.089419544 = queryWeight, product of:
                1.9647306 = boost
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.009579791 = queryNorm
              0.29692957 = fieldWeight in 1012, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.0625 = fieldNorm(doc=1012)
          0.42163777 = weight(abstract_txt:keyphrase in 1012) [ClassicSimilarity], result of:
            0.42163777 = score(doc=1012,freq=5.0), product of:
              0.33037832 = queryWeight, product of:
                3.7765265 = boost
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.009579791 = queryNorm
              1.2762271 = fieldWeight in 1012, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.0625 = fieldNorm(doc=1012)
          0.9580011 = weight(abstract_txt:keyphrases in 1012) [ClassicSimilarity], result of:
            0.9580011 = score(doc=1012,freq=4.0), product of:
              0.81581455 = queryWeight, product of:
                9.065064 = boost
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.009579791 = queryNorm
              1.1742878 = fieldWeight in 1012, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.0625 = fieldNorm(doc=1012)
        0.2 = coord(5/25)
    
  3. Pirkola, A.: Constructing topic-specific search keyphrase suggestion tools for Web information retrieval (2010) 0.24
    0.23826534 = sum of:
      0.23826534 = product of:
        1.4891584 = sum of:
          0.019297004 = weight(abstract_txt:text in 4665) [ClassicSimilarity], result of:
            0.019297004 = score(doc=4665,freq=2.0), product of:
              0.04319048 = queryWeight, product of:
                1.1148981 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.009579791 = queryNorm
              0.44678837 = fieldWeight in 4665, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=4665)
          0.02454579 = weight(abstract_txt:identified in 4665) [ClassicSimilarity], result of:
            0.02454579 = score(doc=4665,freq=1.0), product of:
              0.06388361 = queryWeight, product of:
                1.3559257 = boost
                4.9181023 = idf(docFreq=878, maxDocs=44218)
                0.009579791 = queryNorm
              0.38422674 = fieldWeight in 4665, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.9181023 = idf(docFreq=878, maxDocs=44218)
                0.078125 = fieldNorm(doc=4665)
          0.40824902 = weight(abstract_txt:keyphrase in 4665) [ClassicSimilarity], result of:
            0.40824902 = score(doc=4665,freq=3.0), product of:
              0.33037832 = queryWeight, product of:
                3.7765265 = boost
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.009579791 = queryNorm
              1.2357016 = fieldWeight in 4665, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.078125 = fieldNorm(doc=4665)
          1.0370666 = weight(abstract_txt:keyphrases in 4665) [ClassicSimilarity], result of:
            1.0370666 = score(doc=4665,freq=3.0), product of:
              0.81581455 = queryWeight, product of:
                9.065064 = boost
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.009579791 = queryNorm
              1.2712038 = fieldWeight in 4665, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.078125 = fieldNorm(doc=4665)
        0.16 = coord(4/25)
    
  4. Daudaravicius, V.: ¬A framework for keyphrase extraction from scientific journals (2016) 0.23
    0.22666788 = sum of:
      0.22666788 = product of:
        1.4166743 = sum of:
          0.04157613 = weight(abstract_txt:automatically in 2930) [ClassicSimilarity], result of:
            0.04157613 = score(doc=2930,freq=1.0), product of:
              0.080385916 = queryWeight, product of:
                1.5210067 = boost
                5.5168705 = idf(docFreq=482, maxDocs=44218)
                0.009579791 = queryNorm
              0.5172066 = fieldWeight in 2930, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5168705 = idf(docFreq=482, maxDocs=44218)
                0.09375 = fieldNorm(doc=2930)
          0.076141395 = weight(abstract_txt:keywords in 2930) [ClassicSimilarity], result of:
            0.076141395 = score(doc=2930,freq=2.0), product of:
              0.09550391 = queryWeight, product of:
                1.6578748 = boost
                6.0133076 = idf(docFreq=293, maxDocs=44218)
                0.009579791 = queryNorm
              0.79725945 = fieldWeight in 2930, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.0133076 = idf(docFreq=293, maxDocs=44218)
                0.09375 = fieldNorm(doc=2930)
          0.2828432 = weight(abstract_txt:keyphrase in 2930) [ClassicSimilarity], result of:
            0.2828432 = score(doc=2930,freq=1.0), product of:
              0.33037832 = queryWeight, product of:
                3.7765265 = boost
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.009579791 = queryNorm
              0.85611916 = fieldWeight in 2930, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.09375 = fieldNorm(doc=2930)
          1.0161135 = weight(abstract_txt:keyphrases in 2930) [ClassicSimilarity], result of:
            1.0161135 = score(doc=2930,freq=2.0), product of:
              0.81581455 = queryWeight, product of:
                9.065064 = boost
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.009579791 = queryNorm
              1.2455202 = fieldWeight in 2930, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.09375 = fieldNorm(doc=2930)
        0.16 = coord(4/25)
    
  5. Martín-Moncunill, D.; García-Barriocanal, E.; Sicilia, M.-A.; Sánchez-Alonso, S.: Evaluating the practical applicability of thesaurus-based keyphrase extraction in the agricultural domain : insights from the VOA3R project (2015) 0.21
    0.21020195 = sum of:
      0.21020195 = product of:
        1.3137622 = sum of:
          0.009749946 = weight(abstract_txt:more in 2106) [ClassicSimilarity], result of:
            0.009749946 = score(doc=2106,freq=1.0), product of:
              0.045853943 = queryWeight, product of:
                1.4069386 = boost
                3.402088 = idf(docFreq=4002, maxDocs=44218)
                0.009579791 = queryNorm
              0.2126305 = fieldWeight in 2106, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.402088 = idf(docFreq=4002, maxDocs=44218)
                0.0625 = fieldNorm(doc=2106)
          0.019411938 = weight(abstract_txt:database in 2106) [ClassicSimilarity], result of:
            0.019411938 = score(doc=2106,freq=1.0), product of:
              0.07256956 = queryWeight, product of:
                1.7699623 = boost
                4.2799077 = idf(docFreq=1663, maxDocs=44218)
                0.009579791 = queryNorm
              0.26749423 = fieldWeight in 2106, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2799077 = idf(docFreq=1663, maxDocs=44218)
                0.0625 = fieldNorm(doc=2106)
          0.3265992 = weight(abstract_txt:keyphrase in 2106) [ClassicSimilarity], result of:
            0.3265992 = score(doc=2106,freq=3.0), product of:
              0.33037832 = queryWeight, product of:
                3.7765265 = boost
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.009579791 = queryNorm
              0.9885613 = fieldWeight in 2106, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.0625 = fieldNorm(doc=2106)
          0.9580011 = weight(abstract_txt:keyphrases in 2106) [ClassicSimilarity], result of:
            0.9580011 = score(doc=2106,freq=4.0), product of:
              0.81581455 = queryWeight, product of:
                9.065064 = boost
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.009579791 = queryNorm
              1.1742878 = fieldWeight in 2106, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.0625 = fieldNorm(doc=2106)
        0.16 = coord(4/25)