Search (9 results, page 1 of 1)

Goller, C.; Löning, J.; Will, T.; Wolff, W.: Automatic document classification : a thourough evaluation of various methods (2000) 0.01
```
0.009633917 = product of:
  0.057803504 = sum of:
    0.057803504 = weight(_text_:wide in 5480) [ClassicSimilarity], result of:
      0.057803504 = score(doc=5480,freq=2.0), product of:
        0.19679762 = queryWeight, product of:
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.044416238 = queryNorm
        0.29372054 = fieldWeight in 5480, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.046875 = fieldNorm(doc=5480)
  0.16666667 = coord(1/6)
```
Abstract

(Automatic) document classification is generally defined as content-based assignment of one or more predefined categories to documents. Usually, machine learning, statistical pattern recognition, or neural network approaches are used to construct classifiers automatically. In this paper we thoroughly evaluate a wide variety of these methods on a document classification task for German text. We evaluate different feature construction and selection methods and various classifiers. Our main results are: (1) feature selection is necessary not only to reduce learning and classification time, but also to avoid overfitting (even for Support Vector Machines); (2) surprisingly, our morphological analysis does not improve classification quality compared to a letter 5-gram approach; (3) Support Vector Machines are significantly better than all other classification methods

Polity, Y.: Vers une ergonomie linguistique (1994) 0.01

0.008738637 = product of:
  0.05243182 = sum of:
    0.05243182 = weight(_text_:computer in 36) [ClassicSimilarity], result of:
      0.05243182 = score(doc=36,freq=2.0), product of:
        0.16231956 = queryWeight, product of:
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.044416238 = queryNorm
        0.32301605 = fieldWeight in 36, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.0625 = fieldNorm(doc=36)
  0.16666667 = coord(1/6)

Abstract: Analyzed a special type of man-mchine interaction, that of searching an information system with natural language. A model for full text processing for information retrieval was proposed that considered the system's users and how they employ information. Describes how INIST (the National Institute for Scientific and Technical Information) is developing computer assisted indexing as an aid to improving relevance when retrieving information from bibliographic data banks

Wacholder, N.; Byrd, R.J.: Retrieving information from full text using linguistic knowledge (1994) 0.01
```
0.008245593 = product of:
  0.049473554 = sum of:
    0.049473554 = product of:
      0.09894711 = sum of:
        0.09894711 = weight(_text_:programs in 8524) [ClassicSimilarity], result of:
          0.09894711 = score(doc=8524,freq=2.0), product of:
            0.25748047 = queryWeight, product of:
              5.79699 = idf(docFreq=364, maxDocs=44218)
              0.044416238 = queryNorm
            0.38428974 = fieldWeight in 8524, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.79699 = idf(docFreq=364, maxDocs=44218)
              0.046875 = fieldNorm(doc=8524)
      0.5 = coord(1/2)
  0.16666667 = coord(1/6)
```
Abstract

Examines how techniques in the field of natural language processing can be applied to the analysis of text in information retrieval. State of the art text searching programs cannot distinguish, for example, between occurrences of the sickness, AIDS and aids as tool or between library school and school nor equate such terms as online or on-line which are variants of the same form. To make these distinction, systems must incorporate knowledge about the meaning of words in context. Research in natural language processing has concentrated on the automatic 'understanding' of language; how to analyze the grammatical structure and meaning of text. Although many asoects of this research remain experimental, describes how these techniques to recognize spelling variants, names, acronyms, and abbreviations
Renouf, A.: Sticking to the text : a corpus linguist's view of language (1993) 0.01
```
0.007646307 = product of:
  0.04587784 = sum of:
    0.04587784 = weight(_text_:computer in 2314) [ClassicSimilarity], result of:
      0.04587784 = score(doc=2314,freq=2.0), product of:
        0.16231956 = queryWeight, product of:
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.044416238 = queryNorm
        0.28263903 = fieldWeight in 2314, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2314)
  0.16666667 = coord(1/6)
```
Abstract

Corpus linguistics is the study of large, computer held bodies of text. Some corpus linguists are concerned with language descriptions for its own sake. On the corpus-linguistic continuum, the study of raw ASCII text is situated at one end, and the study of heavily pre-coded text at the other. Discusses the use of word frequency to identify changes in the lexicon; word repetition and word positioning in automatic abstracting and word clusters in automatic text retrieval. Compares the machine extract with manual abstracts. Abstractors and indexers may find themselves taking the original wording of the text more into account as the focus moves towards the electronic medium and away from the hard copy
Salton, G.: Automatic processing of foreign language documents (1985) 0.00
```
0.0043693185 = product of:
  0.02621591 = sum of:
    0.02621591 = weight(_text_:computer in 3650) [ClassicSimilarity], result of:
      0.02621591 = score(doc=3650,freq=2.0), product of:
        0.16231956 = queryWeight, product of:
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.044416238 = queryNorm
        0.16150802 = fieldWeight in 3650, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.03125 = fieldNorm(doc=3650)
  0.16666667 = coord(1/6)
```
Abstract

The attempt to computerize a process, such as indexing, abstracting, classifying, or retrieving information, begins with an analysis of the process into its intellectual and nonintellectual components. That part of the process which is amenable to computerization is mechanical or algorithmic. What is not is intellectual or creative and requires human intervention. Gerard Salton has been an innovator, experimenter, and promoter in the area of mechanized information systems since the early 1960s. He has been particularly ingenious at analyzing the process of information retrieval into its algorithmic components. He received a doctorate in applied mathematics from Harvard University before moving to the computer science department at Cornell, where he developed a prototype automatic retrieval system called SMART. Working with this system he and his students contributed for over a decade to our theoretical understanding of the retrieval process. On a more practical level, they have contributed design criteria for operating retrieval systems. The following selection presents one of the early descriptions of the SMART system; it is valuable as it shows the direction automatic retrieval methods were to take beyond simple word-matching techniques. These include various word normalization techniques to improve recall, for instance, the separation of words into stems and affixes; the correlation and clustering, using statistical association measures, of related terms; and the identification, using a concept thesaurus, of synonymous, broader, narrower, and sibling terms. They include, as weIl, techniques, both linguistic and statistical, to deal with the thorny problem of how to automatically extract from texts index terms that consist of more than one word. They include weighting techniques and various documentrequest matching algorithms. Significant among the latter are those which produce a retrieval output of citations ranked in relevante order. During the 1970s, Salton and his students went an to further refine these various techniques, particularly the weighting and statistical association measures. Many of their early innovations seem commonplace today. Some of their later techniques are still ahead of their time and await technological developments for implementation. The particular focus of the selection that follows is an the evaluation of a particular component of the SMART system, a multilingual thesaurus. By mapping English language expressions and their German equivalents to a common concept number, the thesaurus permitted the automatic processing of German language documents against English language queries and vice versa. The results of the evaluation, as it turned out, were somewhat inconclusive. However, this SMART experiment suggested in a bold and optimistic way how one might proceed to answer such complex questions as What is meant by retrieval language compatability? How it is to be achieved, and how evaluated?

Riloff, E.: ¬An empirical study of automated dictionary construction for information extraction in three domains (1996) 0.00

0.0040118583 = product of:
  0.024071148 = sum of:
    0.024071148 = product of:
      0.048142295 = sum of:
        0.048142295 = weight(_text_:22 in 6752) [ClassicSimilarity], result of:
          0.048142295 = score(doc=6752,freq=2.0), product of:
            0.1555381 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.044416238 = queryNorm
            0.30952093 = fieldWeight in 6752, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0625 = fieldNorm(doc=6752)
      0.5 = coord(1/2)
  0.16666667 = coord(1/6)

Date: 6. 3.1997 16:22:15

SIGIR'92 : Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1992) 0.00

0.0038231534 = product of:
  0.02293892 = sum of:
    0.02293892 = weight(_text_:computer in 6671) [ClassicSimilarity], result of:
      0.02293892 = score(doc=6671,freq=2.0), product of:
        0.16231956 = queryWeight, product of:
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.044416238 = queryNorm
        0.14131951 = fieldWeight in 6671, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.02734375 = fieldNorm(doc=6671)
  0.16666667 = coord(1/6)

Abstract: The conference was organized by the Royal School of Librarianship in Copenhagen and was held in cooperation with AICA-GLIR (Italy), BCS-IRSG (UK), DD (Denmark), GI (Germany), INRIA (France). It had support from Apple Computer, Denmark. The volume contains the 32 papers and reports on the two panel sessions, moderated by W.B. Croft, and R. Kovetz, respectively

Needham, R.M.; Sparck Jones, K.: Keywords and clumps (1985) 0.00
```
0.0038231534 = product of:
  0.02293892 = sum of:
    0.02293892 = weight(_text_:computer in 3645) [ClassicSimilarity], result of:
      0.02293892 = score(doc=3645,freq=2.0), product of:
        0.16231956 = queryWeight, product of:
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.044416238 = queryNorm
        0.14131951 = fieldWeight in 3645, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.02734375 = fieldNorm(doc=3645)
  0.16666667 = coord(1/6)
```
Abstract

The selection that follows was chosen as it represents "a very early paper an the possibilities allowed by computers an documentation." In the early 1960s computers were being used to provide simple automatic indexing systems wherein keywords were extracted from documents. The problem with such systems was that they lacked vocabulary control, thus documents related in subject matter were not always collocated in retrieval. To improve retrieval by improving recall is the raison d'être of vocabulary control tools such as classifications and thesauri. The question arose whether it was possible by automatic means to construct classes of terms, which when substituted, one for another, could be used to improve retrieval performance? One of the first theoretical approaches to this question was initiated by R. M. Needham and Karen Sparck Jones at the Cambridge Language Research Institute in England.t The question was later pursued using experimental methodologies by Sparck Jones, who, as a Senior Research Associate in the Computer Laboratory at the University of Cambridge, has devoted her life's work to research in information retrieval and automatic naturai language processing. Based an the principles of numerical taxonomy, automatic classification techniques start from the premise that two objects are similar to the degree that they share attributes in common. When these two objects are keywords, their similarity is measured in terms of the number of documents they index in common. Step 1 in automatic classification is to compute mathematically the degree to which two terms are similar. Step 2 is to group together those terms that are "most similar" to each other, forming equivalence classes of intersubstitutable terms. The technique for forming such classes varies and is the factor that characteristically distinguishes different approaches to automatic classification. The technique used by Needham and Sparck Jones, that of clumping, is described in the selection that follows. Questions that must be asked are whether the use of automatically generated classes really does improve retrieval performance and whether there is a true eco nomic advantage in substituting mechanical for manual labor. Several years after her work with clumping, Sparck Jones was to observe that while it was not wholly satisfactory in itself, it was valuable in that it stimulated research into automatic classification. To this it might be added that it was valuable in that it introduced to libraryl information science the methods of numerical taxonomy, thus stimulating us to think again about the fundamental nature and purpose of classification. In this connection it might be useful to review how automatically derived classes differ from those of manually constructed classifications: 1) the manner of their derivation is purely a posteriori, the ultimate operationalization of the principle of literary warrant; 2) the relationship between members forming such classes is essentially statistical; the members of a given class are similar to each other not because they possess the class-defining characteristic but by virtue of sharing a family resemblance; and finally, 3) automatically derived classes are not related meaningfully one to another, that is, they are not ordered in traditional hierarchical and precedence relationships.

Lorenz, S.: Konzeption und prototypische Realisierung einer begriffsbasierten Texterschließung (2006) 0.00

0.0030088935 = product of:
  0.01805336 = sum of:
    0.01805336 = product of:
      0.03610672 = sum of:
        0.03610672 = weight(_text_:22 in 1746) [ClassicSimilarity], result of:
          0.03610672 = score(doc=1746,freq=2.0), product of:
            0.1555381 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.044416238 = queryNorm
            0.23214069 = fieldWeight in 1746, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=1746)
      0.5 = coord(1/2)
  0.16666667 = coord(1/6)

Date: 22. 3.2015 9:17:30

Search (9 results, page 1 of 1)

Authors

Years

Languages

Types