Search (8 results, page 1 of 1)

Suominen, O.; Koskenniemi, I.: Annif Analyzer Shootout : comparing text lemmatization methods for automated subject indexing (2022) 0.08
```
0.08299318 = product of:
  0.12448977 = sum of:
    0.08966068 = weight(_text_:systematic in 658) [ClassicSimilarity], result of:
      0.08966068 = score(doc=658,freq=2.0), product of:
        0.28397155 = queryWeight, product of:
          5.715473 = idf(docFreq=395, maxDocs=44218)
          0.049684696 = queryNorm
        0.31573826 = fieldWeight in 658, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.715473 = idf(docFreq=395, maxDocs=44218)
          0.0390625 = fieldNorm(doc=658)
    0.03482909 = product of:
      0.06965818 = sum of:
        0.06965818 = weight(_text_:indexing in 658) [ClassicSimilarity], result of:
          0.06965818 = score(doc=658,freq=6.0), product of:
            0.19018644 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.049684696 = queryNorm
            0.3662626 = fieldWeight in 658, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.0390625 = fieldNorm(doc=658)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

Automated text classification is an important function for many AI systems relevant to libraries, including automated subject indexing and classification. When implemented using the traditional natural language processing (NLP) paradigm, one key part of the process is the normalization of words using stemming or lemmatization, which reduces the amount of linguistic variation and often improves the quality of classification. In this paper, we compare the output of seven different text lemmatization algorithms as well as two baseline methods. We measure how the choice of method affects the quality of text classification using example corpora in three languages. The experiments have been performed using the open source Annif toolkit for automated subject indexing and classification, but should generalize also to other NLP toolkits and similar text classification tasks. The results show that lemmatization methods in most cases outperform baseline methods in text classification particularly for Finnish and Swedish text, but not English, where baseline methods are most effective. The differences between lemmatization methods are quite small. The systematic comparison will help optimize text classification pipelines and inform the further development of the Annif toolkit to incorporate a wider choice of normalization methods.

Gödert, W.: Detecting multiword phrases in mathematical text corpora (2012) 0.02

0.015166845 = product of:
  0.045500536 = sum of:
    0.045500536 = product of:
      0.09100107 = sum of:
        0.09100107 = weight(_text_:indexing in 466) [ClassicSimilarity], result of:
          0.09100107 = score(doc=466,freq=4.0), product of:
            0.19018644 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.049684696 = queryNorm
            0.47848347 = fieldWeight in 466, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.0625 = fieldNorm(doc=466)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Abstract: We present an approach for detecting multiword phrases in mathematical text corpora. The method used is based on characteristic features of mathematical terminology. It makes use of a software tool named Lingo which allows to identify words by means of previously defined dictionaries for specific word classes as adjectives, personal names or nouns. The detection of multiword groups is done algorithmically. Possible advantages of the method for indexing and information retrieval and conclusions for applying dictionary-based methods of automatic indexing instead of stemming procedures are discussed.

Junger, U.: Can indexing be automated? : the example of the Deutsche Nationalbibliothek (2012) 0.01

0.013270989 = product of:
  0.039812967 = sum of:
    0.039812967 = product of:
      0.079625934 = sum of:
        0.079625934 = weight(_text_:indexing in 1717) [ClassicSimilarity], result of:
          0.079625934 = score(doc=1717,freq=4.0), product of:
            0.19018644 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.049684696 = queryNorm
            0.41867304 = fieldWeight in 1717, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1717)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Abstract: The German subject headings authority file (Schlagwortnormdatei/SWD) provides a broad controlled vocabulary for indexing documents of all subjects. Traditionally used for intellectual subject cataloguing primarily of books the Deutsche Nationalbibliothek (DNB, German National Library) has been working on developping and implementing procedures for automated assignment of subject headings for online publications. This project, its results and problems are sketched in the paper.

Mongin, L.; Fu, Y.Y.; Mostafa, J.: Open Archives data Service prototype and automated subject indexing using D-Lib archive content as a testbed (2003) 0.01

0.0080434345 = product of:
  0.024130303 = sum of:
    0.024130303 = product of:
      0.048260607 = sum of:
        0.048260607 = weight(_text_:indexing in 1167) [ClassicSimilarity], result of:
          0.048260607 = score(doc=1167,freq=2.0), product of:
            0.19018644 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.049684696 = queryNorm
            0.2537542 = fieldWeight in 1167, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.046875 = fieldNorm(doc=1167)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Husevag, A.-S.R.: Named entities in indexing : a case study of TV subtitles and metadata records (2016) 0.01

0.0067028617 = product of:
  0.020108584 = sum of:
    0.020108584 = product of:
      0.04021717 = sum of:
        0.04021717 = weight(_text_:indexing in 3105) [ClassicSimilarity], result of:
          0.04021717 = score(doc=3105,freq=2.0), product of:
            0.19018644 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.049684696 = queryNorm
            0.21146181 = fieldWeight in 3105, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3105)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Toepfer, M.; Seifert, C.: Content-based quality estimation for automatic subject indexing of short texts under precision and recall constraints 0.01

0.0067028617 = product of:
  0.020108584 = sum of:
    0.020108584 = product of:
      0.04021717 = sum of:
        0.04021717 = weight(_text_:indexing in 4309) [ClassicSimilarity], result of:
          0.04021717 = score(doc=4309,freq=2.0), product of:
            0.19018644 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.049684696 = queryNorm
            0.21146181 = fieldWeight in 4309, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4309)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Junger, U.; Schwens, U.: ¬Die inhaltliche Erschließung des schriftlichen kulturellen Erbes auf dem Weg in die Zukunft : Automatische Vergabe von Schlagwörtern in der Deutschen Nationalbibliothek (2017) 0.01

0.005609659 = product of:
  0.016828977 = sum of:
    0.016828977 = product of:
      0.033657953 = sum of:
        0.033657953 = weight(_text_:22 in 3780) [ClassicSimilarity], result of:
          0.033657953 = score(doc=3780,freq=2.0), product of:
            0.17398734 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.049684696 = queryNorm
            0.19345059 = fieldWeight in 3780, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3780)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Date: 19. 8.2017 9:24:22

Search Engines and Beyond : Developing efficient knowledge management systems, April 19-20 1999, Boston, Mass (1999) 0.01
```
0.00536229 = product of:
  0.016086869 = sum of:
    0.016086869 = product of:
      0.032173738 = sum of:
        0.032173738 = weight(_text_:indexing in 2596) [ClassicSimilarity], result of:
          0.032173738 = score(doc=2596,freq=2.0), product of:
            0.19018644 = queryWeight, product of:
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.049684696 = queryNorm
            0.16916946 = fieldWeight in 2596, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.8278677 = idf(docFreq=2614, maxDocs=44218)
              0.03125 = fieldNorm(doc=2596)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)
```
Content

Ramana Rao (Inxight, Palo Alto, CA) 7 ± 2 Insights on achieving Effective Information Access Session One: Updates and a twelve month perspective Danny Sullivan (Search Engine Watch, US / England) Portalization and other search trends Carol Tenopir (University of Tennessee) Search realities faced by end users and professional searchers Session Two: Today's search engines and beyond Daniel Hoogterp (Retrieval Technologies, McLean, VA) Effective presentation and utilization of search techniques Rick Kenny (Fulcrum Technologies, Ontario, Canada) Beyond document clustering: The knowledge impact statement Gary Stock (Ingenius, Kalamazoo, MI) Automated change monitoring Gary Culliss (Direct Hit, Wellesley Hills, MA) User popularity ranked search engines Byron Dom (IBM, CA) Automatically finding the best pages on the World Wide Web (CLEVER) Peter Tomassi (LookSmart, San Francisco, CA) Adding human intellect to search technology Session Three: Panel discussion: Human v automated categorization and editing Ev Brenner (New York, NY)- Chairman James Callan (University of Massachusetts, MA) Marc Krellenstein (Northern Light Technology, Cambridge, MA) Dan Miller (Ask Jeeves, Berkeley, CA) Session Four: Updates and a twelve month perspective Steve Arnold (AIT, Harrods Creek, KY) Review: The leading edge in search and retrieval software Ellen Voorhees (NIST, Gaithersburg, MD) TREC update Session Five: Search engines now and beyond Intelligent Agents John Snyder (Muscat, Cambridge, England) Practical issues behind intelligent agents Text summarization Therese Firmin, (Dept of Defense, Ft George G. Meade, MD) The TIPSTER/SUMMAC evaluation of automatic text summarization systems Cross language searching Elizabeth Liddy (TextWise, Syracuse, NY) A conceptual interlingua approach to cross-language retrieval. Video search and retrieval Armon Amir (IBM, Almaden, CA) CueVideo: Modular system for automatic indexing and browsing of video/audio Speech recognition Michael Witbrock (Lycos, Waltham, MA) Retrieval of spoken documents Visualization James A. Wise (Integral Visuals, Richland, WA) Information visualization in the new millennium: Emerging science or passing fashion? Text mining David Evans (Claritech, Pittsburgh, PA) Text mining - towards decision support

Search (8 results, page 1 of 1)

Authors

Years

Languages

Themes