Search (5 results, page 1 of 1)

  • × theme_ss:"Computerlinguistik"
  • × type_ss:"el"
  • × year_i:[1990 TO 2000}
  1. Sebastiani, F.: ¬A tutorial an automated text categorisation (1999) 0.00
    0.0024924895 = product of:
      0.004984979 = sum of:
        0.004984979 = product of:
          0.009969958 = sum of:
            0.009969958 = weight(_text_:a in 3390) [ClassicSimilarity], result of:
              0.009969958 = score(doc=3390,freq=18.0), product of:
                0.043477926 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.037706986 = queryNorm
                0.22931081 = fieldWeight in 3390, product of:
                  4.2426405 = tf(freq=18.0), with freq of:
                    18.0 = termFreq=18.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046875 = fieldNorm(doc=3390)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to 1960. Until the late '80s, the dominant approach to the problem involved knowledge-engineering automatic categorisers, i.e. manually building a set of rules encoding expert knowledge an how to classify documents. In the '90s, with the booming production and availability of on-line documents, automated text categorisation has witnessed an increased and renewed interest. A newer paradigm based an machine learning has superseded the previous approach. Within this paradigm, a general inductive process automatically builds a classifier by "learning", from a set of previously classified documents, the characteristics of one or more categories; the advantages are a very good effectiveness, a considerable savings in terms of expert manpower, and domain independence. In this tutorial we look at the main approaches that have been taken towards automatic text categorisation within the general machine learning paradigm. Issues of document indexing, classifier construction, and classifier evaluation, will be touched upon.
  2. Schmid, H.: Improvements in Part-of-Speech tagging with an application to German (1995) 0.00
    0.002477056 = product of:
      0.004954112 = sum of:
        0.004954112 = product of:
          0.009908224 = sum of:
            0.009908224 = weight(_text_:a in 124) [ClassicSimilarity], result of:
              0.009908224 = score(doc=124,freq=10.0), product of:
                0.043477926 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.037706986 = queryNorm
                0.22789092 = fieldWeight in 124, product of:
                  3.1622777 = tf(freq=10.0), with freq of:
                    10.0 = termFreq=10.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0625 = fieldNorm(doc=124)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    This paper presents a couple of extensions to a basic Markov Model tagger (called TreeTagger) which improve its accuracy when trained on small corpora. The basic tagger was originally developed for English Schmid, 1994. The extensions together reduced error rates on a German test corpus by more than a third.
    Type
    a
  3. Dunning, T.: Statistical identification of language (1994) 0.00
    0.0023499418 = product of:
      0.0046998835 = sum of:
        0.0046998835 = product of:
          0.009399767 = sum of:
            0.009399767 = weight(_text_:a in 3627) [ClassicSimilarity], result of:
              0.009399767 = score(doc=3627,freq=16.0), product of:
                0.043477926 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.037706986 = queryNorm
                0.2161963 = fieldWeight in 3627, product of:
                  4.0 = tf(freq=16.0), with freq of:
                    16.0 = termFreq=16.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046875 = fieldNorm(doc=3627)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    A statistically based program has been written which learns to distinguish between languages. The amount of training text that such a program needs is surprisingly small, and the amount of text needed to make an identification is also quite small. The program incorporates no linguistic presuppositions other than the assumption that text can be encoded as a string of bytes. Such a program can be used to determine which language small bits of text are in. It also shows a potential for what might be called 'statistical philology' in that it may be applied directly to phonetic transcriptions to help elucidate family trees among language dialects. A variant of this program has been shown to be useful as a quality control in biochemistry. In this application, genetic sequences are assumed to be expressions in a language peculiar to the organism from which the sequence is taken. Thus language identification becomes species identification.
  4. Chowdhury, A.; Mccabe, M.C.: Improving information retrieval systems using part of speech tagging (1993) 0.00
    0.002035109 = product of:
      0.004070218 = sum of:
        0.004070218 = product of:
          0.008140436 = sum of:
            0.008140436 = weight(_text_:a in 1061) [ClassicSimilarity], result of:
              0.008140436 = score(doc=1061,freq=12.0), product of:
                0.043477926 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.037706986 = queryNorm
                0.18723148 = fieldWeight in 1061, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046875 = fieldNorm(doc=1061)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    The object of Information Retrieval is to retrieve all relevant documents for a user query and only those relevant documents. Much research has focused on achieving this objective with little regard for storage overhead or performance. In the paper we evaluate the use of Part of Speech Tagging to improve, the index storage overhead and general speed of the system with only a minimal reduction to precision recall measurements. We tagged 500Mbs of the Los Angeles Times 1990 and 1989 document collection provided by TREC for parts of speech. We then experimented to find the most relevant part of speech to index. We show that 90% of precision recall is achieved with 40% of the document collections terms. We also show that this is a improvement in overhead with only a 1% reduction in precision recall.
    Type
    a
  5. Rindflesch, T.C.; Aronson, A.R.: Semantic processing in information retrieval (1993) 0.00
    0.0013707994 = product of:
      0.0027415988 = sum of:
        0.0027415988 = product of:
          0.0054831975 = sum of:
            0.0054831975 = weight(_text_:a in 4121) [ClassicSimilarity], result of:
              0.0054831975 = score(doc=4121,freq=4.0), product of:
                0.043477926 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.037706986 = queryNorm
                0.12611452 = fieldWeight in 4121, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=4121)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Intuition suggests that one way to enhance the information retrieval process would be the use of phrases to characterize the contents of text. A number of researchers, however, have noted that phrases alone do not improve retrieval effectiveness. In this paper we briefly review the use of phrases in information retrieval and then suggest extensions to this paradigm using semantic information. We claim that semantic processing, which can be viewed as expressing relations between the concepts represented by phrases, will in fact enhance retrieval effectiveness. The availability of the UMLS® domain model, which we exploit extensively, significantly contributes to the feasibility of this processing.
    Type
    a