Search (108 results, page 1 of 6)

Galvez, C.; Moya-Anegón, F. de: ¬An evaluation of conflation accuracy using finite-state transducers (2006) 0.04
```
0.044796567 = product of:
  0.1343897 = sum of:
    0.11347422 = weight(_text_:graphic in 5599) [ClassicSimilarity], result of:
      0.11347422 = score(doc=5599,freq=2.0), product of:
        0.25850594 = queryWeight, product of:
          6.6217136 = idf(docFreq=159, maxDocs=44218)
          0.03903913 = queryNorm
        0.43896174 = fieldWeight in 5599, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.6217136 = idf(docFreq=159, maxDocs=44218)
          0.046875 = fieldNorm(doc=5599)
    0.020915478 = product of:
      0.041830957 = sum of:
        0.041830957 = weight(_text_:methods in 5599) [ClassicSimilarity], result of:
          0.041830957 = score(doc=5599,freq=2.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.26651827 = fieldWeight in 5599, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.046875 = fieldNorm(doc=5599)
      0.5 = coord(1/2)
  0.33333334 = coord(2/6)
```
Abstract

Purpose - To evaluate the accuracy of conflation methods based on finite-state transducers (FSTs). Design/methodology/approach - Incorrectly lemmatized and stemmed forms may lead to the retrieval of inappropriate documents. Experimental studies to date have focused on retrieval performance, but very few on conflation performance. The process of normalization we used involved a linguistic toolbox that allowed us to construct, through graphic interfaces, electronic dictionaries represented internally by FSTs. The lexical resources developed were applied to a Spanish test corpus for merging term variants in canonical lemmatized forms. Conflation performance was evaluated in terms of an adaptation of recall and precision measures, based on accuracy and coverage, not actual retrieval. The results were compared with those obtained using a Spanish version of the Porter algorithm. Findings - The conclusion is that the main strength of lemmatization is its accuracy, whereas its main limitation is the underanalysis of variant forms. Originality/value - The report outlines the potential of transducers in their application to normalization processes.

Kutschekmanesch, S.; Lutes, B.; Moelle, K.; Thiel, U.; Tzeras, K.: Automated multilingual indexing : a synthesis of rule-based and thesaurus-based methods (1998) 0.02

0.02043515 = product of:
  0.1226109 = sum of:
    0.1226109 = sum of:
      0.069718264 = weight(_text_:methods in 4157) [ClassicSimilarity], result of:
        0.069718264 = score(doc=4157,freq=2.0), product of:
          0.15695344 = queryWeight, product of:
            4.0204134 = idf(docFreq=2156, maxDocs=44218)
            0.03903913 = queryNorm
          0.4441971 = fieldWeight in 4157, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.0204134 = idf(docFreq=2156, maxDocs=44218)
            0.078125 = fieldNorm(doc=4157)
      0.052892637 = weight(_text_:22 in 4157) [ClassicSimilarity], result of:
        0.052892637 = score(doc=4157,freq=2.0), product of:
          0.1367084 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.03903913 = queryNorm
          0.38690117 = fieldWeight in 4157, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.078125 = fieldNorm(doc=4157)
  0.16666667 = coord(1/6)

Source: Information und Märkte: 50. Deutscher Dokumentartag 1998, Kongreß der Deutschen Gesellschaft für Dokumentation e.V. (DGD), Rheinische Friedrich-Wilhelms-Universität Bonn, 22.-24. September 1998. Hrsg. von Marlies Ockenfeld u. Gerhard J. Mantwill

Schulz, K.U.; Brunner, L.: Vollautomatische thematische Verschlagwortung großer Textkollektionen mittels semantischer Netze (2017) 0.01
```
0.014928497 = product of:
  0.089570984 = sum of:
    0.089570984 = sum of:
      0.052210055 = weight(_text_:theory in 3493) [ClassicSimilarity], result of:
        0.052210055 = score(doc=3493,freq=2.0), product of:
          0.16234003 = queryWeight, product of:
            4.1583924 = idf(docFreq=1878, maxDocs=44218)
            0.03903913 = queryNorm
          0.32160926 = fieldWeight in 3493, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.1583924 = idf(docFreq=1878, maxDocs=44218)
            0.0546875 = fieldNorm(doc=3493)
      0.03736093 = weight(_text_:29 in 3493) [ClassicSimilarity], result of:
        0.03736093 = score(doc=3493,freq=2.0), product of:
          0.13732746 = queryWeight, product of:
            3.5176873 = idf(docFreq=3565, maxDocs=44218)
            0.03903913 = queryNorm
          0.27205724 = fieldWeight in 3493, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5176873 = idf(docFreq=3565, maxDocs=44218)
            0.0546875 = fieldNorm(doc=3493)
  0.16666667 = coord(1/6)
```
Source

Theorie, Semantik und Organisation von Wissen: Proceedings der 13. Tagung der Deutschen Sektion der Internationalen Gesellschaft für Wissensorganisation (ISKO) und dem 13. Internationalen Symposium der Informationswissenschaft der Higher Education Association for Information Science (HI) Potsdam (19.-20.03.2013): 'Theory, Information and Organization of Knowledge' / Proceedings der 14. Tagung der Deutschen Sektion der Internationalen Gesellschaft für Wissensorganisation (ISKO) und Natural Language & Information Systems (NLDB) Passau (16.06.2015): 'Lexical Resources for Knowledge Organization' / Proceedings des Workshops der Deutschen Sektion der Internationalen Gesellschaft für Wissensorganisation (ISKO) auf der SEMANTICS Leipzig (1.09.2014): 'Knowledge Organization and Semantic Web' / Proceedings des Workshops der Polnischen und Deutschen Sektion der Internationalen Gesellschaft für Wissensorganisation (ISKO) Cottbus (29.-30.09.2011): 'Economics of Knowledge Production and Organization'. Hrsg. von W. Babik, H.P. Ohly u. K. Weber
Böhm, A.; Seifert, C.; Schlötterer, J.; Granitzer, M.: Identifying tweets from the economic domain (2017) 0.01
```
0.014928497 = product of:
  0.089570984 = sum of:
    0.089570984 = sum of:
      0.052210055 = weight(_text_:theory in 3495) [ClassicSimilarity], result of:
        0.052210055 = score(doc=3495,freq=2.0), product of:
          0.16234003 = queryWeight, product of:
            4.1583924 = idf(docFreq=1878, maxDocs=44218)
            0.03903913 = queryNorm
          0.32160926 = fieldWeight in 3495, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.1583924 = idf(docFreq=1878, maxDocs=44218)
            0.0546875 = fieldNorm(doc=3495)
      0.03736093 = weight(_text_:29 in 3495) [ClassicSimilarity], result of:
        0.03736093 = score(doc=3495,freq=2.0), product of:
          0.13732746 = queryWeight, product of:
            3.5176873 = idf(docFreq=3565, maxDocs=44218)
            0.03903913 = queryNorm
          0.27205724 = fieldWeight in 3495, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5176873 = idf(docFreq=3565, maxDocs=44218)
            0.0546875 = fieldNorm(doc=3495)
  0.16666667 = coord(1/6)
```
Source

Theorie, Semantik und Organisation von Wissen: Proceedings der 13. Tagung der Deutschen Sektion der Internationalen Gesellschaft für Wissensorganisation (ISKO) und dem 13. Internationalen Symposium der Informationswissenschaft der Higher Education Association for Information Science (HI) Potsdam (19.-20.03.2013): 'Theory, Information and Organization of Knowledge' / Proceedings der 14. Tagung der Deutschen Sektion der Internationalen Gesellschaft für Wissensorganisation (ISKO) und Natural Language & Information Systems (NLDB) Passau (16.06.2015): 'Lexical Resources for Knowledge Organization' / Proceedings des Workshops der Deutschen Sektion der Internationalen Gesellschaft für Wissensorganisation (ISKO) auf der SEMANTICS Leipzig (1.09.2014): 'Knowledge Organization and Semantic Web' / Proceedings des Workshops der Polnischen und Deutschen Sektion der Internationalen Gesellschaft für Wissensorganisation (ISKO) Cottbus (29.-30.09.2011): 'Economics of Knowledge Production and Organization'. Hrsg. von W. Babik, H.P. Ohly u. K. Weber
Kempf, A.O.: Neue Verfahrenswege der Wissensorganisation : eine Evaluation automatischer Indexierung in der sozialwissenschaftlichen Fachinformation (2017) 0.01
```
0.014928497 = product of:
  0.089570984 = sum of:
    0.089570984 = sum of:
      0.052210055 = weight(_text_:theory in 3497) [ClassicSimilarity], result of:
        0.052210055 = score(doc=3497,freq=2.0), product of:
          0.16234003 = queryWeight, product of:
            4.1583924 = idf(docFreq=1878, maxDocs=44218)
            0.03903913 = queryNorm
          0.32160926 = fieldWeight in 3497, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.1583924 = idf(docFreq=1878, maxDocs=44218)
            0.0546875 = fieldNorm(doc=3497)
      0.03736093 = weight(_text_:29 in 3497) [ClassicSimilarity], result of:
        0.03736093 = score(doc=3497,freq=2.0), product of:
          0.13732746 = queryWeight, product of:
            3.5176873 = idf(docFreq=3565, maxDocs=44218)
            0.03903913 = queryNorm
          0.27205724 = fieldWeight in 3497, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5176873 = idf(docFreq=3565, maxDocs=44218)
            0.0546875 = fieldNorm(doc=3497)
  0.16666667 = coord(1/6)
```
Source

Theorie, Semantik und Organisation von Wissen: Proceedings der 13. Tagung der Deutschen Sektion der Internationalen Gesellschaft für Wissensorganisation (ISKO) und dem 13. Internationalen Symposium der Informationswissenschaft der Higher Education Association for Information Science (HI) Potsdam (19.-20.03.2013): 'Theory, Information and Organization of Knowledge' / Proceedings der 14. Tagung der Deutschen Sektion der Internationalen Gesellschaft für Wissensorganisation (ISKO) und Natural Language & Information Systems (NLDB) Passau (16.06.2015): 'Lexical Resources for Knowledge Organization' / Proceedings des Workshops der Deutschen Sektion der Internationalen Gesellschaft für Wissensorganisation (ISKO) auf der SEMANTICS Leipzig (1.09.2014): 'Knowledge Organization and Semantic Web' / Proceedings des Workshops der Polnischen und Deutschen Sektion der Internationalen Gesellschaft für Wissensorganisation (ISKO) Cottbus (29.-30.09.2011): 'Economics of Knowledge Production and Organization'. Hrsg. von W. Babik, H.P. Ohly u. K. Weber

Matthews, P.; Glitre, K.: Genre analysis of movies using a topic model of plot summaries (2021) 0.01

0.0144304065 = product of:
  0.04329122 = sum of:
    0.022375738 = product of:
      0.044751476 = sum of:
        0.044751476 = weight(_text_:theory in 412) [ClassicSimilarity], result of:
          0.044751476 = score(doc=412,freq=2.0), product of:
            0.16234003 = queryWeight, product of:
              4.1583924 = idf(docFreq=1878, maxDocs=44218)
              0.03903913 = queryNorm
            0.27566507 = fieldWeight in 412, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.1583924 = idf(docFreq=1878, maxDocs=44218)
              0.046875 = fieldNorm(doc=412)
      0.5 = coord(1/2)
    0.020915478 = product of:
      0.041830957 = sum of:
        0.041830957 = weight(_text_:methods in 412) [ClassicSimilarity], result of:
          0.041830957 = score(doc=412,freq=2.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.26651827 = fieldWeight in 412, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.046875 = fieldNorm(doc=412)
      0.5 = coord(1/2)
  0.33333334 = coord(2/6)

Abstract: Genre plays an important role in the description, navigation, and discovery of movies, but it is rarely studied at large scale using quantitative methods. This allows an analysis of how genre labels are applied, how genres are composed and how these ingredients change, and how genres compare. We apply unsupervised topic modeling to a large collection of textual movie summaries and then use the model's topic proportions to investigate key questions in genre, including recognizability, mapping, canonicity, and change over time. We find that many genres can be quite easily predicted by their lexical signatures and this defines their position on the genre landscape. We find significant genre composition changes between periods for westerns, science fiction and road movies, reflecting changes in production and consumption values. We show that in terms of canonicity, canonical examples are often at the high end of the topic distribution profile for the genre rather than central as might be predicted by categorization theory.

Newman, D.J.; Block, S.: Probabilistic topic decomposition of an eighteenth-century American newspaper (2006) 0.01

0.014304606 = product of:
  0.085827634 = sum of:
    0.085827634 = sum of:
      0.048802786 = weight(_text_:methods in 5291) [ClassicSimilarity], result of:
        0.048802786 = score(doc=5291,freq=2.0), product of:
          0.15695344 = queryWeight, product of:
            4.0204134 = idf(docFreq=2156, maxDocs=44218)
            0.03903913 = queryNorm
          0.31093797 = fieldWeight in 5291, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.0204134 = idf(docFreq=2156, maxDocs=44218)
            0.0546875 = fieldNorm(doc=5291)
      0.037024844 = weight(_text_:22 in 5291) [ClassicSimilarity], result of:
        0.037024844 = score(doc=5291,freq=2.0), product of:
          0.1367084 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.03903913 = queryNorm
          0.2708308 = fieldWeight in 5291, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0546875 = fieldNorm(doc=5291)
  0.16666667 = coord(1/6)

Abstract: We use a probabilistic mixture decomposition method to determine topics in the Pennsylvania Gazette, a major colonial U.S. newspaper from 1728-1800. We assess the value of several topic decomposition techniques for historical research and compare the accuracy and efficacy of various methods. After determining the topics covered by the 80,000 articles and advertisements in the entire 18th century run of the Gazette, we calculate how the prevalence of those topics changed over time, and give historically relevant examples of our findings. This approach reveals important information about the content of this colonial newspaper, and suggests the value of such approaches to a more complete understanding of early American print culture and society.
Date: 22. 7.2006 17:32:00

Wolfekuhler, M.R.; Punch, W.F.: Finding salient features for personal Web pages categories (1997) 0.01

0.01239763 = product of:
  0.03719289 = sum of:
    0.018680464 = product of:
      0.03736093 = sum of:
        0.03736093 = weight(_text_:29 in 2673) [ClassicSimilarity], result of:
          0.03736093 = score(doc=2673,freq=2.0), product of:
            0.13732746 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.03903913 = queryNorm
            0.27205724 = fieldWeight in 2673, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2673)
      0.5 = coord(1/2)
    0.018512422 = product of:
      0.037024844 = sum of:
        0.037024844 = weight(_text_:22 in 2673) [ClassicSimilarity], result of:
          0.037024844 = score(doc=2673,freq=2.0), product of:
            0.1367084 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03903913 = queryNorm
            0.2708308 = fieldWeight in 2673, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2673)
      0.5 = coord(1/2)
  0.33333334 = coord(2/6)

Date: 1. 8.1996 22:08:06
Source: Computer networks and ISDN systems. 29(1997) no.8, S.1147-1156

Franke-Maier, M.: Anforderungen an die Qualität der Inhaltserschließung im Spannungsfeld von intellektuell und automatisch erzeugten Metadaten (2018) 0.01

0.01239763 = product of:
  0.03719289 = sum of:
    0.018680464 = product of:
      0.03736093 = sum of:
        0.03736093 = weight(_text_:29 in 5344) [ClassicSimilarity], result of:
          0.03736093 = score(doc=5344,freq=2.0), product of:
            0.13732746 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.03903913 = queryNorm
            0.27205724 = fieldWeight in 5344, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.0546875 = fieldNorm(doc=5344)
      0.5 = coord(1/2)
    0.018512422 = product of:
      0.037024844 = sum of:
        0.037024844 = weight(_text_:22 in 5344) [ClassicSimilarity], result of:
          0.037024844 = score(doc=5344,freq=2.0), product of:
            0.1367084 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03903913 = queryNorm
            0.2708308 = fieldWeight in 5344, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=5344)
      0.5 = coord(1/2)
  0.33333334 = coord(2/6)

Abstract: Spätestens seit dem Deutschen Bibliothekartag 2018 hat sich die Diskussion zu den automatischen Verfahren der Inhaltserschließung der Deutschen Nationalbibliothek von einer politisch geführten Diskussion in eine Qualitätsdiskussion verwandelt. Der folgende Beitrag beschäftigt sich mit Fragen der Qualität von Inhaltserschließung in digitalen Zeiten, wo heterogene Erzeugnisse unterschiedlicher Verfahren aufeinandertreffen und versucht, wichtige Anforderungen an Qualität zu definieren. Dieser Tagungsbeitrag fasst die vom Autor als Impulse vorgetragenen Ideen beim Workshop der FAG "Erschließung und Informationsvermittlung" des GBV am 29. August 2018 in Kiel zusammen. Der Workshop fand im Rahmen der 22. Verbundkonferenz des GBV statt.

Salton, G.: Automatic processing of foreign language documents (1985) 0.01
```
0.009620271 = product of:
  0.028860811 = sum of:
    0.014917159 = product of:
      0.029834319 = sum of:
        0.029834319 = weight(_text_:theory in 3650) [ClassicSimilarity], result of:
          0.029834319 = score(doc=3650,freq=2.0), product of:
            0.16234003 = queryWeight, product of:
              4.1583924 = idf(docFreq=1878, maxDocs=44218)
              0.03903913 = queryNorm
            0.18377672 = fieldWeight in 3650, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.1583924 = idf(docFreq=1878, maxDocs=44218)
              0.03125 = fieldNorm(doc=3650)
      0.5 = coord(1/2)
    0.013943653 = product of:
      0.027887305 = sum of:
        0.027887305 = weight(_text_:methods in 3650) [ClassicSimilarity], result of:
          0.027887305 = score(doc=3650,freq=2.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.17767884 = fieldWeight in 3650, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03125 = fieldNorm(doc=3650)
      0.5 = coord(1/2)
  0.33333334 = coord(2/6)
```
Abstract

The attempt to computerize a process, such as indexing, abstracting, classifying, or retrieving information, begins with an analysis of the process into its intellectual and nonintellectual components. That part of the process which is amenable to computerization is mechanical or algorithmic. What is not is intellectual or creative and requires human intervention. Gerard Salton has been an innovator, experimenter, and promoter in the area of mechanized information systems since the early 1960s. He has been particularly ingenious at analyzing the process of information retrieval into its algorithmic components. He received a doctorate in applied mathematics from Harvard University before moving to the computer science department at Cornell, where he developed a prototype automatic retrieval system called SMART. Working with this system he and his students contributed for over a decade to our theoretical understanding of the retrieval process. On a more practical level, they have contributed design criteria for operating retrieval systems. The following selection presents one of the early descriptions of the SMART system; it is valuable as it shows the direction automatic retrieval methods were to take beyond simple word-matching techniques. These include various word normalization techniques to improve recall, for instance, the separation of words into stems and affixes; the correlation and clustering, using statistical association measures, of related terms; and the identification, using a concept thesaurus, of synonymous, broader, narrower, and sibling terms. They include, as weIl, techniques, both linguistic and statistical, to deal with the thorny problem of how to automatically extract from texts index terms that consist of more than one word. They include weighting techniques and various documentrequest matching algorithms. Significant among the latter are those which produce a retrieval output of citations ranked in relevante order. During the 1970s, Salton and his students went an to further refine these various techniques, particularly the weighting and statistical association measures. Many of their early innovations seem commonplace today. Some of their later techniques are still ahead of their time and await technological developments for implementation. The particular focus of the selection that follows is an the evaluation of a particular component of the SMART system, a multilingual thesaurus. By mapping English language expressions and their German equivalents to a common concept number, the thesaurus permitted the automatic processing of German language documents against English language queries and vice versa. The results of the evaluation, as it turned out, were somewhat inconclusive. However, this SMART experiment suggested in a bold and optimistic way how one might proceed to answer such complex questions as What is meant by retrieval language compatability? How it is to be achieved, and how evaluated?

Source

Theory of subject analysis: a sourcebook. Ed.: L.M. Chan, et al

Griffiths, A.; Robinson, L.A.; Willett, P.: Hierarchic agglomerative clustering methods for automatic document classification (1984) 0.01

0.009295769 = product of:
  0.05577461 = sum of:
    0.05577461 = product of:
      0.11154922 = sum of:
        0.11154922 = weight(_text_:methods in 2414) [ClassicSimilarity], result of:
          0.11154922 = score(doc=2414,freq=2.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.71071535 = fieldWeight in 2414, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.125 = fieldNorm(doc=2414)
      0.5 = coord(1/2)
  0.16666667 = coord(1/6)

Kuhlen, R.: Morphologische Relationen durch Reduktionsalgorithmen (1974) 0.01

0.008806056 = product of:
  0.052836336 = sum of:
    0.052836336 = product of:
      0.10567267 = sum of:
        0.10567267 = weight(_text_:29 in 4251) [ClassicSimilarity], result of:
          0.10567267 = score(doc=4251,freq=4.0), product of:
            0.13732746 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.03903913 = queryNorm
            0.7694941 = fieldWeight in 4251, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.109375 = fieldNorm(doc=4251)
      0.5 = coord(1/2)
  0.16666667 = coord(1/6)

Date: 29. 1.2011 14:56:29

SIGIR'92 : Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1992) 0.01
```
0.008417737 = product of:
  0.02525321 = sum of:
    0.013052514 = product of:
      0.026105028 = sum of:
        0.026105028 = weight(_text_:theory in 6671) [ClassicSimilarity], result of:
          0.026105028 = score(doc=6671,freq=2.0), product of:
            0.16234003 = queryWeight, product of:
              4.1583924 = idf(docFreq=1878, maxDocs=44218)
              0.03903913 = queryNorm
            0.16080463 = fieldWeight in 6671, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.1583924 = idf(docFreq=1878, maxDocs=44218)
              0.02734375 = fieldNorm(doc=6671)
      0.5 = coord(1/2)
    0.012200696 = product of:
      0.024401393 = sum of:
        0.024401393 = weight(_text_:methods in 6671) [ClassicSimilarity], result of:
          0.024401393 = score(doc=6671,freq=2.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.15546899 = fieldWeight in 6671, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.02734375 = fieldNorm(doc=6671)
      0.5 = coord(1/2)
  0.33333334 = coord(2/6)
```
Content

HARMAN, D.: Relevance feedback revisited; AALBERSBERG, I.J.: Incremental relevance feedback; TAGUE-SUTCLIFFE, J.: Measuring the informativeness of a retrieval process; LEWIS, D.D.: An evaluation of phrasal and clustered representations on a text categorization task; BLOSSEVILLE, M.J., G. HÉBRAIL, M.G. MONTEIL u. N. PÉNOT: Automatic document classification: natural language processing, statistical analysis, and expert system techniques used together; MASAND, B., G. LINOFF u. D. WALTZ: Classifying news stories using memory based reasoning; KEEN, E.M.: Term position ranking: some new test results; CROUCH, C.J. u. B. YANG: Experiments in automatic statistical thesaurus construction; GREFENSTETTE, G.: Use of syntactic context to produce term association lists for text retrieval; ANICK, P.G. u. R.A. FLYNN: Versioning of full-text information retrieval system; BURKOWSKI, F.J.: Retrieval activities in a database consisting of heterogeneous collections; DEERWESTER, S.C., K. WACLENA u. M. LaMAR: A textual object management system; NIE, J.-Y.:Towards a probabilistic modal logic for semantic-based information retrieval; WANG, A.W., S.K.M. WONG u. Y.Y. YAO: An analysis of vector space models based on computational geometry; BARTELL, B.T., G.W. COTTRELL u. R.K. BELEW: Latent semantic indexing is an optimal special case of multidimensional scaling; GLAVITSCH, U. u. P. SCHÄUBLE: A system for retrieving speech documents; MARGULIS, E.L.: N-Poisson document modelling; HESS, M.: An incrementally extensible document retrieval system based on linguistics and logical principles; COOPER, W.S., F.C. GEY u. D.P. DABNEY: Probabilistic retrieval based on staged logistic regression; FUHR, N.: Integration of probabilistic fact and text retrieval; CROFT, B., L.A. SMITH u. H. TURTLE: A loosely-coupled integration of a text retrieval system and an object-oriented database system; DUMAIS, S.T. u. J. NIELSEN: Automating the assignement of submitted manuscripts to reviewers; GOST, M.A. u. M. MASOTTI: Design of an OPAC database to permit different subject searching accesses; ROBERTSON, A.M. u. P. WILLETT: Searching for historical word forms in a database of 17th century English text using spelling correction methods; FAX, E.A., Q.F. CHEN u. L.S. HEATH: A faster algorithm for constructing minimal perfect hash functions; MOFFAT, A. u. J. ZOBEL: Parameterised compression for sparse bitmaps; GRANDI, F., P. TIBERIO u. P. Zezula: Frame-sliced patitioned parallel signature files; ALLEN, B.: Cognitive differences in end user searching of a CD-ROM index; SONNENWALD, D.H.: Developing a theory to guide the process of designing information retrieval systems; CUTTING, D.R., J.O. PEDERSEN, D. KARGER, u. J.W. TUKEY: Scatter/ Gather: a cluster-based approach to browsing large document collections; CHALMERS, M. u. P. CHITSON: Bead: Explorations in information visualization; WILLIAMSON, C. u. B. SHNEIDERMAN: The dynamic HomeFinder: evaluating dynamic queries in a real-estate information exploring system
Needham, R.M.; Sparck Jones, K.: Keywords and clumps (1985) 0.01
```
0.008417737 = product of:
  0.02525321 = sum of:
    0.013052514 = product of:
      0.026105028 = sum of:
        0.026105028 = weight(_text_:theory in 3645) [ClassicSimilarity], result of:
          0.026105028 = score(doc=3645,freq=2.0), product of:
            0.16234003 = queryWeight, product of:
              4.1583924 = idf(docFreq=1878, maxDocs=44218)
              0.03903913 = queryNorm
            0.16080463 = fieldWeight in 3645, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.1583924 = idf(docFreq=1878, maxDocs=44218)
              0.02734375 = fieldNorm(doc=3645)
      0.5 = coord(1/2)
    0.012200696 = product of:
      0.024401393 = sum of:
        0.024401393 = weight(_text_:methods in 3645) [ClassicSimilarity], result of:
          0.024401393 = score(doc=3645,freq=2.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.15546899 = fieldWeight in 3645, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.02734375 = fieldNorm(doc=3645)
      0.5 = coord(1/2)
  0.33333334 = coord(2/6)
```
Abstract

The selection that follows was chosen as it represents "a very early paper an the possibilities allowed by computers an documentation." In the early 1960s computers were being used to provide simple automatic indexing systems wherein keywords were extracted from documents. The problem with such systems was that they lacked vocabulary control, thus documents related in subject matter were not always collocated in retrieval. To improve retrieval by improving recall is the raison d'être of vocabulary control tools such as classifications and thesauri. The question arose whether it was possible by automatic means to construct classes of terms, which when substituted, one for another, could be used to improve retrieval performance? One of the first theoretical approaches to this question was initiated by R. M. Needham and Karen Sparck Jones at the Cambridge Language Research Institute in England.t The question was later pursued using experimental methodologies by Sparck Jones, who, as a Senior Research Associate in the Computer Laboratory at the University of Cambridge, has devoted her life's work to research in information retrieval and automatic naturai language processing. Based an the principles of numerical taxonomy, automatic classification techniques start from the premise that two objects are similar to the degree that they share attributes in common. When these two objects are keywords, their similarity is measured in terms of the number of documents they index in common. Step 1 in automatic classification is to compute mathematically the degree to which two terms are similar. Step 2 is to group together those terms that are "most similar" to each other, forming equivalence classes of intersubstitutable terms. The technique for forming such classes varies and is the factor that characteristically distinguishes different approaches to automatic classification. The technique used by Needham and Sparck Jones, that of clumping, is described in the selection that follows. Questions that must be asked are whether the use of automatically generated classes really does improve retrieval performance and whether there is a true eco nomic advantage in substituting mechanical for manual labor. Several years after her work with clumping, Sparck Jones was to observe that while it was not wholly satisfactory in itself, it was valuable in that it stimulated research into automatic classification. To this it might be added that it was valuable in that it introduced to libraryl information science the methods of numerical taxonomy, thus stimulating us to think again about the fundamental nature and purpose of classification. In this connection it might be useful to review how automatically derived classes differ from those of manually constructed classifications: 1) the manner of their derivation is purely a posteriori, the ultimate operationalization of the principle of literary warrant; 2) the relationship between members forming such classes is essentially statistical; the members of a given class are similar to each other not because they possess the class-defining characteristic but by virtue of sharing a family resemblance; and finally, 3) automatically derived classes are not related meaningfully one to another, that is, they are not ordered in traditional hierarchical and precedence relationships.

Source

Theory of subject analysis: a sourcebook. Ed.: L.M. Chan, et al
Suominen, O.; Koskenniemi, I.: Annif Analyzer Shootout : comparing text lemmatization methods for automated subject indexing (2022) 0.01
```
0.0076857167 = product of:
  0.0461143 = sum of:
    0.0461143 = product of:
      0.0922286 = sum of:
        0.0922286 = weight(_text_:methods in 658) [ClassicSimilarity], result of:
          0.0922286 = score(doc=658,freq=14.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.5876176 = fieldWeight in 658, product of:
              3.7416575 = tf(freq=14.0), with freq of:
                14.0 = termFreq=14.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.0390625 = fieldNorm(doc=658)
      0.5 = coord(1/2)
  0.16666667 = coord(1/6)
```
Abstract

Automated text classification is an important function for many AI systems relevant to libraries, including automated subject indexing and classification. When implemented using the traditional natural language processing (NLP) paradigm, one key part of the process is the normalization of words using stemming or lemmatization, which reduces the amount of linguistic variation and often improves the quality of classification. In this paper, we compare the output of seven different text lemmatization algorithms as well as two baseline methods. We measure how the choice of method affects the quality of text classification using example corpora in three languages. The experiments have been performed using the open source Annif toolkit for automated subject indexing and classification, but should generalize also to other NLP toolkits and similar text classification tasks. The results show that lemmatization methods in most cases outperform baseline methods in text classification particularly for Finnish and Swedish text, but not English, where baseline methods are most effective. The differences between lemmatization methods are quite small. The systematic comparison will help optimize text classification pipelines and inform the further development of the Annif toolkit to incorporate a wider choice of normalization methods.

Panyr, J.: STEINADLER: ein Verfahren zur automatischen Deskribierung und zur automatischen thematischen Klassifikation (1978) 0.01

0.007116368 = product of:
  0.04269821 = sum of:
    0.04269821 = product of:
      0.08539642 = sum of:
        0.08539642 = weight(_text_:29 in 5169) [ClassicSimilarity], result of:
          0.08539642 = score(doc=5169,freq=2.0), product of:
            0.13732746 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.03903913 = queryNorm
            0.6218451 = fieldWeight in 5169, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.125 = fieldNorm(doc=5169)
      0.5 = coord(1/2)
  0.16666667 = coord(1/6)

Source: Nachrichten für Dokumentation. 29(1978), S.92-96

Salton, G.; Yang, C.S.: On the specification of term values in automatic indexing (1973) 0.01

0.007116368 = product of:
  0.04269821 = sum of:
    0.04269821 = product of:
      0.08539642 = sum of:
        0.08539642 = weight(_text_:29 in 5476) [ClassicSimilarity], result of:
          0.08539642 = score(doc=5476,freq=2.0), product of:
            0.13732746 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.03903913 = queryNorm
            0.6218451 = fieldWeight in 5476, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.125 = fieldNorm(doc=5476)
      0.5 = coord(1/2)
  0.16666667 = coord(1/6)

Source: Journal of documentation. 29(1973), S.351-372

Voorhees, E.M.: Implementing agglomerative hierarchic clustering algorithms for use in document retrieval (1986) 0.01

0.0070523517 = product of:
  0.04231411 = sum of:
    0.04231411 = product of:
      0.08462822 = sum of:
        0.08462822 = weight(_text_:22 in 402) [ClassicSimilarity], result of:
          0.08462822 = score(doc=402,freq=2.0), product of:
            0.1367084 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03903913 = queryNorm
            0.61904186 = fieldWeight in 402, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.125 = fieldNorm(doc=402)
      0.5 = coord(1/2)
  0.16666667 = coord(1/6)

Source: Information processing and management. 22(1986) no.6, S.465-476

Goller, C.; Löning, J.; Will, T.; Wolff, W.: Automatic document classification : a thourough evaluation of various methods (2000) 0.01
```
0.0069718263 = product of:
  0.041830957 = sum of:
    0.041830957 = product of:
      0.083661914 = sum of:
        0.083661914 = weight(_text_:methods in 5480) [ClassicSimilarity], result of:
          0.083661914 = score(doc=5480,freq=8.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.53303653 = fieldWeight in 5480, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.046875 = fieldNorm(doc=5480)
      0.5 = coord(1/2)
  0.16666667 = coord(1/6)
```
Abstract

(Automatic) document classification is generally defined as content-based assignment of one or more predefined categories to documents. Usually, machine learning, statistical pattern recognition, or neural network approaches are used to construct classifiers automatically. In this paper we thoroughly evaluate a wide variety of these methods on a document classification task for German text. We evaluate different feature construction and selection methods and various classifiers. Our main results are: (1) feature selection is necessary not only to reduce learning and classification time, but also to avoid overfitting (even for Support Vector Machines); (2) surprisingly, our morphological analysis does not improve classification quality compared to a letter 5-gram approach; (3) Support Vector Machines are significantly better than all other classification methods

Salton, G.: Fast document classification in automatic information retrieval (1978) 0.01

0.006573101 = product of:
  0.039438605 = sum of:
    0.039438605 = product of:
      0.07887721 = sum of:
        0.07887721 = weight(_text_:methods in 2331) [ClassicSimilarity], result of:
          0.07887721 = score(doc=2331,freq=4.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.5025517 = fieldWeight in 2331, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.0625 = fieldNorm(doc=2331)
      0.5 = coord(1/2)
  0.16666667 = coord(1/6)

Abstract: A classified or clustered file is one where related or similar records are grouped into classes or clusters of items in such a way that all itmes within a cluster are jointly retrievable. Clustered files are easily adapted to to broad and narrow search strategies, and simple file updating methods are available. An inexpensive file clustering method applicable to large files is given together with appropriate file search methods

Search (108 results, page 1 of 6)

Authors

Years

Languages

Types

Themes

Classifications