Document (#33872)

Author
Medelyan, O.
Witten, I.H.
Title
Domain-independent automatic keyphrase indexing with small training sets
Source
Journal of the American Society for Information Science and Technology. 59(2008) no.7, S.1026-1040
Year
2008
Abstract
Keyphrases are widely used in both physical and digital libraries as a brief, but precise, summary of documents. They help organize material based on content, provide thematic access, represent search results, and assist with navigation. Manual assignment is expensive because trained human indexers must reach an understanding of the document and select appropriate descriptors according to defined cataloging rules. We propose a new method that enhances automatic keyphrase extraction by using semantic information about terms and phrases gleaned from a domain-specific thesaurus. The key advantage of the new approach is that it performs well with very little training data. We evaluate it on a large set of manually indexed documents in the domain of agriculture, compare its consistency with a group of six professional indexers, and explore its performance on smaller collections of documents in other domains and of French and Spanish documents.
Theme
Automatisches Indexieren

Similar documents (author)

  1. Witten, I.H.; Frank, E.: Data Mining : Praktische Werkzeuge und Techniken für das maschinelle Lernen (2000) 4.61
    4.6059904 = sum of:
      4.6059904 = weight(author_txt:witten in 6833) [ClassicSimilarity], result of:
        4.6059904 = fieldWeight in 6833, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.211981 = idf(docFreq=11, maxDocs=44218)
          0.5 = fieldNorm(doc=6833)
    
  2. Witten, I.H.; Bainbridge, D.: Creating digital library collections with Greenstone (2005) 4.61
    4.6059904 = sum of:
      4.6059904 = weight(author_txt:witten in 2578) [ClassicSimilarity], result of:
        4.6059904 = fieldWeight in 2578, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.211981 = idf(docFreq=11, maxDocs=44218)
          0.5 = fieldNorm(doc=2578)
    
  3. Witten, I.H.; Moffat, A.; Bell, T.C.: Managing gigabytes : compressing and indexing documents and images (1994) 3.45
    3.4544928 = sum of:
      3.4544928 = weight(author_txt:witten in 3083) [ClassicSimilarity], result of:
        3.4544928 = fieldWeight in 3083, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.211981 = idf(docFreq=11, maxDocs=44218)
          0.375 = fieldNorm(doc=3083)
    
  4. Bainbridge, D.; Dewsnip, M.; Witten, l.H.: Searching digital music libraries (2005) 3.45
    3.4544928 = sum of:
      3.4544928 = weight(author_txt:witten in 997) [ClassicSimilarity], result of:
        3.4544928 = fieldWeight in 997, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.211981 = idf(docFreq=11, maxDocs=44218)
          0.375 = fieldNorm(doc=997)
    
  5. Witten, I.H.; Bainbridge, D.; Boddie, S.J.: Greenstone : open-source digital library software (2001) 3.45
    3.4544928 = sum of:
      3.4544928 = weight(author_txt:witten in 1225) [ClassicSimilarity], result of:
        3.4544928 = fieldWeight in 1225, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.211981 = idf(docFreq=11, maxDocs=44218)
          0.375 = fieldNorm(doc=1225)
    

Similar documents (content)

  1. Jones, S.; Paynter, G.W.: Automatic extractionof document keyphrases for use in digital libraries : evaluations and applications (2002) 0.38
    0.38073152 = sum of:
      0.38073152 = product of:
        1.1897861 = sum of:
          0.05636736 = weight(abstract_txt:manually in 601) [ClassicSimilarity], result of:
            0.05636736 = score(doc=601,freq=1.0), product of:
              0.13568129 = queryWeight, product of:
                1.0297269 = boost
                6.6470313 = idf(docFreq=155, maxDocs=44218)
                0.019823037 = queryNorm
              0.41543946 = fieldWeight in 601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6470313 = idf(docFreq=155, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
          0.05669626 = weight(abstract_txt:descriptors in 601) [ClassicSimilarity], result of:
            0.05669626 = score(doc=601,freq=1.0), product of:
              0.13620856 = queryWeight, product of:
                1.0317258 = boost
                6.6599345 = idf(docFreq=153, maxDocs=44218)
                0.019823037 = queryNorm
              0.4162459 = fieldWeight in 601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6599345 = idf(docFreq=153, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
          0.4210049 = weight(abstract_txt:keyphrases in 601) [ClassicSimilarity], result of:
            0.4210049 = score(doc=601,freq=7.0), product of:
              0.27101517 = queryWeight, product of:
                1.4553212 = boost
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.019823037 = queryNorm
              1.5534366 = fieldWeight in 601, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
          0.051282775 = weight(abstract_txt:training in 601) [ClassicSimilarity], result of:
            0.051282775 = score(doc=601,freq=1.0), product of:
              0.16050646 = queryWeight, product of:
                1.5838838 = boost
                5.112096 = idf(docFreq=723, maxDocs=44218)
                0.019823037 = queryNorm
              0.319506 = fieldWeight in 601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.112096 = idf(docFreq=723, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
          0.07613705 = weight(abstract_txt:automatic in 601) [ClassicSimilarity], result of:
            0.07613705 = score(doc=601,freq=2.0), product of:
              0.16579275 = queryWeight, product of:
                1.6097552 = boost
                5.1955976 = idf(docFreq=665, maxDocs=44218)
                0.019823037 = queryNorm
              0.45923027 = fieldWeight in 601, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.1955976 = idf(docFreq=665, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
          0.061148874 = weight(abstract_txt:domain in 601) [ClassicSimilarity], result of:
            0.061148874 = score(doc=601,freq=1.0), product of:
              0.20660186 = queryWeight, product of:
                2.2008467 = boost
                4.7355914 = idf(docFreq=1054, maxDocs=44218)
                0.019823037 = queryNorm
              0.29597446 = fieldWeight in 601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7355914 = idf(docFreq=1054, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
          0.05374128 = weight(abstract_txt:documents in 601) [ClassicSimilarity], result of:
            0.05374128 = score(doc=601,freq=1.0), product of:
              0.20863804 = queryWeight, product of:
                2.5538113 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.019823037 = queryNorm
              0.2575814 = fieldWeight in 601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
          0.41340753 = weight(abstract_txt:keyphrase in 601) [ClassicSimilarity], result of:
            0.41340753 = score(doc=601,freq=2.0), product of:
              0.5121774 = queryWeight, product of:
                2.8293538 = boost
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.019823037 = queryNorm
              0.8071569 = fieldWeight in 601, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.0625 = fieldNorm(doc=601)
        0.32 = coord(8/25)
    
  2. Wu, Y.-f.B.; Li, Q.; Bot, R.S.; Chen, X.: Finding nuggets in documents : a machine learning approach (2006) 0.35
    0.34854794 = sum of:
      0.34854794 = product of:
        1.244814 = sum of:
          0.05201002 = weight(abstract_txt:summary in 5290) [ClassicSimilarity], result of:
            0.05201002 = score(doc=5290,freq=1.0), product of:
              0.12859562 = queryWeight, product of:
                1.0024787 = boost
                6.4711404 = idf(docFreq=185, maxDocs=44218)
                0.019823037 = queryNorm
              0.40444627 = fieldWeight in 5290, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.4711404 = idf(docFreq=185, maxDocs=44218)
                0.0625 = fieldNorm(doc=5290)
          0.05636736 = weight(abstract_txt:manually in 5290) [ClassicSimilarity], result of:
            0.05636736 = score(doc=5290,freq=1.0), product of:
              0.13568129 = queryWeight, product of:
                1.0297269 = boost
                6.6470313 = idf(docFreq=155, maxDocs=44218)
                0.019823037 = queryNorm
              0.41543946 = fieldWeight in 5290, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6470313 = idf(docFreq=155, maxDocs=44218)
                0.0625 = fieldNorm(doc=5290)
          0.06219339 = weight(abstract_txt:phrases in 5290) [ClassicSimilarity], result of:
            0.06219339 = score(doc=5290,freq=1.0), product of:
              0.1448764 = queryWeight, product of:
                1.0640472 = boost
                6.8685737 = idf(docFreq=124, maxDocs=44218)
                0.019823037 = queryNorm
              0.42928585 = fieldWeight in 5290, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.8685737 = idf(docFreq=124, maxDocs=44218)
                0.0625 = fieldNorm(doc=5290)
          0.4210049 = weight(abstract_txt:keyphrases in 5290) [ClassicSimilarity], result of:
            0.4210049 = score(doc=5290,freq=7.0), product of:
              0.27101517 = queryWeight, product of:
                1.4553212 = boost
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.019823037 = queryNorm
              1.5534366 = fieldWeight in 5290, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.0625 = fieldNorm(doc=5290)
          0.053837027 = weight(abstract_txt:automatic in 5290) [ClassicSimilarity], result of:
            0.053837027 = score(doc=5290,freq=1.0), product of:
              0.16579275 = queryWeight, product of:
                1.6097552 = boost
                5.1955976 = idf(docFreq=665, maxDocs=44218)
                0.019823037 = queryNorm
              0.32472485 = fieldWeight in 5290, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1955976 = idf(docFreq=665, maxDocs=44218)
                0.0625 = fieldNorm(doc=5290)
          0.09308263 = weight(abstract_txt:documents in 5290) [ClassicSimilarity], result of:
            0.09308263 = score(doc=5290,freq=3.0), product of:
              0.20863804 = queryWeight, product of:
                2.5538113 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.019823037 = queryNorm
              0.44614407 = fieldWeight in 5290, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=5290)
          0.50631875 = weight(abstract_txt:keyphrase in 5290) [ClassicSimilarity], result of:
            0.50631875 = score(doc=5290,freq=3.0), product of:
              0.5121774 = queryWeight, product of:
                2.8293538 = boost
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.019823037 = queryNorm
              0.9885613 = fieldWeight in 5290, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.0625 = fieldNorm(doc=5290)
        0.28 = coord(7/25)
    
  3. Jiang, Y.; Meng, R.; Huang, Y.; Lu, W.; Liu, J.: Generating keyphrases for readers : a controllable keyphrase generation framework (2023) 0.28
    0.2815362 = sum of:
      0.2815362 = product of:
        1.1730675 = sum of:
          0.06219339 = weight(abstract_txt:phrases in 1012) [ClassicSimilarity], result of:
            0.06219339 = score(doc=1012,freq=1.0), product of:
              0.1448764 = queryWeight, product of:
                1.0640472 = boost
                6.8685737 = idf(docFreq=124, maxDocs=44218)
                0.019823037 = queryNorm
              0.42928585 = fieldWeight in 1012, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.8685737 = idf(docFreq=124, maxDocs=44218)
                0.0625 = fieldNorm(doc=1012)
          0.3182498 = weight(abstract_txt:keyphrases in 1012) [ClassicSimilarity], result of:
            0.3182498 = score(doc=1012,freq=4.0), product of:
              0.27101517 = queryWeight, product of:
                1.4553212 = boost
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.019823037 = queryNorm
              1.1742878 = fieldWeight in 1012, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.0625 = fieldNorm(doc=1012)
          0.023983669 = weight(abstract_txt:with in 1012) [ClassicSimilarity], result of:
            0.023983669 = score(doc=1012,freq=4.0), product of:
              0.076755926 = queryWeight, product of:
                1.5489879 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.019823037 = queryNorm
              0.31246668 = fieldWeight in 1012, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.0625 = fieldNorm(doc=1012)
          0.053837027 = weight(abstract_txt:automatic in 1012) [ClassicSimilarity], result of:
            0.053837027 = score(doc=1012,freq=1.0), product of:
              0.16579275 = queryWeight, product of:
                1.6097552 = boost
                5.1955976 = idf(docFreq=665, maxDocs=44218)
                0.019823037 = queryNorm
              0.32472485 = fieldWeight in 1012, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1955976 = idf(docFreq=665, maxDocs=44218)
                0.0625 = fieldNorm(doc=1012)
          0.061148874 = weight(abstract_txt:domain in 1012) [ClassicSimilarity], result of:
            0.061148874 = score(doc=1012,freq=1.0), product of:
              0.20660186 = queryWeight, product of:
                2.2008467 = boost
                4.7355914 = idf(docFreq=1054, maxDocs=44218)
                0.019823037 = queryNorm
              0.29597446 = fieldWeight in 1012, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7355914 = idf(docFreq=1054, maxDocs=44218)
                0.0625 = fieldNorm(doc=1012)
          0.6536547 = weight(abstract_txt:keyphrase in 1012) [ClassicSimilarity], result of:
            0.6536547 = score(doc=1012,freq=5.0), product of:
              0.5121774 = queryWeight, product of:
                2.8293538 = boost
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.019823037 = queryNorm
              1.2762271 = fieldWeight in 1012, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.0625 = fieldNorm(doc=1012)
        0.24 = coord(6/25)
    
  4. Zhang, Y.; Zhang, C.; Li, J.: Joint modeling of characters, words, and conversation contexts for microblog keyphrase extraction (2020) 0.19
    0.19211726 = sum of:
      0.19211726 = product of:
        0.9605863 = sum of:
          0.06760158 = weight(abstract_txt:trained in 5816) [ClassicSimilarity], result of:
            0.06760158 = score(doc=5816,freq=1.0), product of:
              0.15315789 = queryWeight, product of:
                1.0940363 = boost
                7.062158 = idf(docFreq=102, maxDocs=44218)
                0.019823037 = queryNorm
              0.44138488 = fieldWeight in 5816, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.062158 = idf(docFreq=102, maxDocs=44218)
                0.0625 = fieldNorm(doc=5816)
          0.011991834 = weight(abstract_txt:with in 5816) [ClassicSimilarity], result of:
            0.011991834 = score(doc=5816,freq=1.0), product of:
              0.076755926 = queryWeight, product of:
                1.5489879 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.019823037 = queryNorm
              0.15623334 = fieldWeight in 5816, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.0625 = fieldNorm(doc=5816)
          0.053837027 = weight(abstract_txt:automatic in 5816) [ClassicSimilarity], result of:
            0.053837027 = score(doc=5816,freq=1.0), product of:
              0.16579275 = queryWeight, product of:
                1.6097552 = boost
                5.1955976 = idf(docFreq=665, maxDocs=44218)
                0.019823037 = queryNorm
              0.32472485 = fieldWeight in 5816, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1955976 = idf(docFreq=665, maxDocs=44218)
                0.0625 = fieldNorm(doc=5816)
          0.05374128 = weight(abstract_txt:documents in 5816) [ClassicSimilarity], result of:
            0.05374128 = score(doc=5816,freq=1.0), product of:
              0.20863804 = queryWeight, product of:
                2.5538113 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.019823037 = queryNorm
              0.2575814 = fieldWeight in 5816, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=5816)
          0.7734146 = weight(abstract_txt:keyphrase in 5816) [ClassicSimilarity], result of:
            0.7734146 = score(doc=5816,freq=7.0), product of:
              0.5121774 = queryWeight, product of:
                2.8293538 = boost
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.019823037 = queryNorm
              1.5100522 = fieldWeight in 5816, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.0625 = fieldNorm(doc=5816)
        0.2 = coord(5/25)
    
  5. Pirkola, A.: Constructing topic-specific search keyphrase suggestion tools for Web information retrieval (2010) 0.18
    0.18032901 = sum of:
      0.18032901 = product of:
        1.1270564 = sum of:
          0.13465263 = weight(abstract_txt:phrases in 4665) [ClassicSimilarity], result of:
            0.13465263 = score(doc=4665,freq=3.0), product of:
              0.1448764 = queryWeight, product of:
                1.0640472 = boost
                6.8685737 = idf(docFreq=124, maxDocs=44218)
                0.019823037 = queryNorm
              0.9294311 = fieldWeight in 4665, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.8685737 = idf(docFreq=124, maxDocs=44218)
                0.078125 = fieldNorm(doc=4665)
          0.3445155 = weight(abstract_txt:keyphrases in 4665) [ClassicSimilarity], result of:
            0.3445155 = score(doc=4665,freq=3.0), product of:
              0.27101517 = queryWeight, product of:
                1.4553212 = boost
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.019823037 = queryNorm
              1.2712038 = fieldWeight in 4665, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.078125 = fieldNorm(doc=4665)
          0.014989792 = weight(abstract_txt:with in 4665) [ClassicSimilarity], result of:
            0.014989792 = score(doc=4665,freq=1.0), product of:
              0.076755926 = queryWeight, product of:
                1.5489879 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.019823037 = queryNorm
              0.19529167 = fieldWeight in 4665, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.078125 = fieldNorm(doc=4665)
          0.63289845 = weight(abstract_txt:keyphrase in 4665) [ClassicSimilarity], result of:
            0.63289845 = score(doc=4665,freq=3.0), product of:
              0.5121774 = queryWeight, product of:
                2.8293538 = boost
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.019823037 = queryNorm
              1.2357016 = fieldWeight in 4665, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.078125 = fieldNorm(doc=4665)
        0.16 = coord(4/25)