Search (4 results, page 1 of 1)

  • × theme_ss:"Automatisches Klassifizieren"
  • × theme_ss:"Computerlinguistik"
  • × type_ss:"a"
  1. Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.37
    0.36854386 = product of:
      0.4913918 = sum of:
        0.04713235 = product of:
          0.14139704 = sum of:
            0.14139704 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
              0.14139704 = score(doc=562,freq=2.0), product of:
                0.25158808 = queryWeight, product of:
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.029675366 = queryNorm
                0.56201804 = fieldWeight in 562, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.046875 = fieldNorm(doc=562)
          0.33333334 = coord(1/3)
        0.020951848 = weight(_text_:web in 562) [ClassicSimilarity], result of:
          0.020951848 = score(doc=562,freq=2.0), product of:
            0.096845865 = queryWeight, product of:
              3.2635105 = idf(docFreq=4597, maxDocs=44218)
              0.029675366 = queryNorm
            0.21634221 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.2635105 = idf(docFreq=4597, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.14139704 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
          0.14139704 = score(doc=562,freq=2.0), product of:
            0.25158808 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.029675366 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.14139704 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
          0.14139704 = score(doc=562,freq=2.0), product of:
            0.25158808 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.029675366 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.027816659 = weight(_text_:data in 562) [ClassicSimilarity], result of:
          0.027816659 = score(doc=562,freq=4.0), product of:
            0.093835 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.029675366 = queryNorm
            0.29644224 = fieldWeight in 562, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.11269685 = sum of:
          0.08857323 = weight(_text_:mining in 562) [ClassicSimilarity], result of:
            0.08857323 = score(doc=562,freq=4.0), product of:
              0.16744171 = queryWeight, product of:
                5.642448 = idf(docFreq=425, maxDocs=44218)
                0.029675366 = queryNorm
              0.5289795 = fieldWeight in 562, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.642448 = idf(docFreq=425, maxDocs=44218)
                0.046875 = fieldNorm(doc=562)
          0.024123615 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
            0.024123615 = score(doc=562,freq=2.0), product of:
              0.103918076 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.029675366 = queryNorm
              0.23214069 = fieldWeight in 562, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.046875 = fieldNorm(doc=562)
      0.75 = coord(6/8)
    
    Content
    Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
    Date
    8. 1.2013 10:22:32
    Source
    Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), 1-4 November 2004, Brighton, UK
  2. Ko, Y.: ¬A new term-weighting scheme for text classification using the odds of positive and negative class probabilities (2015) 0.00
    0.003914421 = product of:
      0.031315368 = sum of:
        0.031315368 = product of:
          0.062630735 = sum of:
            0.062630735 = weight(_text_:mining in 2339) [ClassicSimilarity], result of:
              0.062630735 = score(doc=2339,freq=2.0), product of:
                0.16744171 = queryWeight, product of:
                  5.642448 = idf(docFreq=425, maxDocs=44218)
                  0.029675366 = queryNorm
                0.37404498 = fieldWeight in 2339, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  5.642448 = idf(docFreq=425, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2339)
          0.5 = coord(1/2)
      0.125 = coord(1/8)
    
    Abstract
    Text classification (TC) is a core technique for text mining and information retrieval. It has been applied to many applications in many different research and industrial areas. Term-weighting schemes assign an appropriate weight to each term to obtain a high TC performance. Although term weighting is one of the important modules for TC and TC has different peculiarities from those in information retrieval, many term-weighting schemes used in information retrieval, such as term frequency-inverse document frequency (tf-idf), have been used in TC in the same manner. The peculiarity of TC that differs most from information retrieval is the existence of class information. This article proposes a new term-weighting scheme that uses class information using positive and negative class distributions. As a result, the proposed scheme, log tf-TRR, consistently performs better than do other schemes using class information as well as traditional schemes such as tf-idf.
  3. Peng, F.; Huang, X.: Machine learning for Asian language text classification (2007) 0.00
    0.003086499 = product of:
      0.024691992 = sum of:
        0.024691992 = weight(_text_:web in 831) [ClassicSimilarity], result of:
          0.024691992 = score(doc=831,freq=4.0), product of:
            0.096845865 = queryWeight, product of:
              3.2635105 = idf(docFreq=4597, maxDocs=44218)
              0.029675366 = queryNorm
            0.25496176 = fieldWeight in 831, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.2635105 = idf(docFreq=4597, maxDocs=44218)
              0.0390625 = fieldNorm(doc=831)
      0.125 = coord(1/8)
    
    Abstract
    Purpose - The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task. Design/methodology/approach - Naïve Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentation-based approach was compared with the non-segmentation-based approach. Findings - There were two findings: the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features; and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy. Practical implications - Apply the findings to real web text classification is ongoing work. Originality/value - The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.
  4. Ibekwe-SanJuan, F.; SanJuan, E.: From term variants to research topics (2002) 0.00
    0.0028975685 = product of:
      0.023180548 = sum of:
        0.023180548 = weight(_text_:data in 1853) [ClassicSimilarity], result of:
          0.023180548 = score(doc=1853,freq=4.0), product of:
            0.093835 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.029675366 = queryNorm
            0.24703519 = fieldWeight in 1853, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1853)
      0.125 = coord(1/8)
    
    Abstract
    In a scientific and technological watch (STW) task, an expert user needs to survey the evolution of research topics in his area of specialisation in order to detect interesting changes. The majority of methods proposing evaluation metrics (bibliometrics and scientometrics studies) for STW rely solely an statistical data analysis methods (Co-citation analysis, co-word analysis). Such methods usually work an structured databases where the units of analysis (words, keywords) are already attributed to documents by human indexers. The advent of huge amounts of unstructured textual data has rendered necessary the integration of natural language processing (NLP) techniques to first extract meaningful units from texts. We propose a method for STW which is NLP-oriented. The method not only analyses texts linguistically in order to extract terms from them, but also uses linguistic relations (syntactic variations) as the basis for clustering. Terms and variation relations are formalised as weighted di-graphs which the clustering algorithm, CPCL (Classification by Preferential Clustered Link) will seek to reduce in order to produces classes. These classes ideally represent the research topics present in the corpus. The results of the classification are subjected to validation by an expert in STW.