Search (255 results, page 1 of 13)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.14

0.14434984 = sum of:
  0.08056292 = product of:
    0.24168874 = sum of:
      0.24168874 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
        0.24168874 = score(doc=562,freq=2.0), product of:
          0.43003735 = queryWeight, product of:
            8.478011 = idf(docFreq=24, maxDocs=44218)
            0.050723847 = queryNorm
          0.56201804 = fieldWeight in 562, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            8.478011 = idf(docFreq=24, maxDocs=44218)
            0.046875 = fieldNorm(doc=562)
    0.33333334 = coord(1/3)
  0.04316979 = weight(_text_:based in 562) [ClassicSimilarity], result of:
    0.04316979 = score(doc=562,freq=4.0), product of:
      0.15283063 = queryWeight, product of:
        3.0129938 = idf(docFreq=5906, maxDocs=44218)
        0.050723847 = queryNorm
      0.28246817 = fieldWeight in 562, product of:
        2.0 = tf(freq=4.0), with freq of:
          4.0 = termFreq=4.0
        3.0129938 = idf(docFreq=5906, maxDocs=44218)
        0.046875 = fieldNorm(doc=562)
  0.020617142 = product of:
    0.041234285 = sum of:
      0.041234285 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
        0.041234285 = score(doc=562,freq=2.0), product of:
          0.17762627 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.050723847 = queryNorm
          0.23214069 = fieldWeight in 562, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.046875 = fieldNorm(doc=562)
    0.5 = coord(1/2)

Abstract: Document representations for text classification are typically based on the classical Bag-Of-Words paradigm. This approach comes with deficiencies that motivate the integration of features on a higher semantic level than single words. In this paper we propose an enhancement of the classical document representation through concepts extracted from background knowledge. Boosting is used for actual classification. Experimental evaluations on two well known text corpora support our approach through consistent improvement of the results.
Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32

Somers, H.: Example-based machine translation : Review article (1999) 0.08

0.07955547 = product of:
  0.11933319 = sum of:
    0.07122652 = weight(_text_:based in 6672) [ClassicSimilarity], result of:
      0.07122652 = score(doc=6672,freq=2.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.46604872 = fieldWeight in 6672, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.109375 = fieldNorm(doc=6672)
    0.048106667 = product of:
      0.09621333 = sum of:
        0.09621333 = weight(_text_:22 in 6672) [ClassicSimilarity], result of:
          0.09621333 = score(doc=6672,freq=2.0), product of:
            0.17762627 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.050723847 = queryNorm
            0.5416616 = fieldWeight in 6672, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.109375 = fieldNorm(doc=6672)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Date: 31. 7.1996 9:22:19

Chandrasekar, R.; Srinivas, B.: Automatic induction of rules for text simplification (1997) 0.08

0.07600854 = product of:
  0.11401281 = sum of:
    0.07122652 = weight(_text_:based in 2873) [ClassicSimilarity], result of:
      0.07122652 = score(doc=2873,freq=8.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.46604872 = fieldWeight in 2873, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2873)
    0.04278629 = product of:
      0.08557258 = sum of:
        0.08557258 = weight(_text_:training in 2873) [ClassicSimilarity], result of:
          0.08557258 = score(doc=2873,freq=2.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.3612125 = fieldWeight in 2873, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2873)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: Explores methods to automatically transform sentences in order to make them simpler. These methods involve the use of a rule-based system, driven by the syntax of the text in the domain of interest. Hand-crafting rules for every domain is time-consuming and impractical. Describes an algorithm and an implementation by which generalized rules for simplification are automatically induced from annotated training materials using a novel partial parsing technique, which combines constituent structure and dependency information. The algorithm employs example-based generalisations on linguistically motivated structures
Footnote: Contribution to an issue devoted to papers from the International Conference on Knowledge Based Computer systems, 16-18 Dec 1996, Mumbai, India
Source: Knowledge-based systems. 10(1997) no.3, S.183-190

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I.: Attention Is all you need (2017) 0.07
```
0.0711273 = product of:
  0.10669095 = sum of:
    0.04316979 = weight(_text_:based in 970) [ClassicSimilarity], result of:
      0.04316979 = score(doc=970,freq=4.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.28246817 = fieldWeight in 970, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.046875 = fieldNorm(doc=970)
    0.06352116 = product of:
      0.12704232 = sum of:
        0.12704232 = weight(_text_:training in 970) [ClassicSimilarity], result of:
          0.12704232 = score(doc=970,freq=6.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.53626144 = fieldWeight in 970, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.046875 = fieldNorm(doc=970)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Brychcín, T.; Konopík, M.: HPS: High precision stemmer (2015) 0.06
```
0.05927276 = product of:
  0.088909134 = sum of:
    0.035974823 = weight(_text_:based in 2686) [ClassicSimilarity], result of:
      0.035974823 = score(doc=2686,freq=4.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.23539014 = fieldWeight in 2686, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2686)
    0.052934308 = product of:
      0.105868615 = sum of:
        0.105868615 = weight(_text_:training in 2686) [ClassicSimilarity], result of:
          0.105868615 = score(doc=2686,freq=6.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.44688457 = fieldWeight in 2686, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2686)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

Research into unsupervised ways of stemming has resulted, in the past few years, in the development of methods that are reliable and perform well. Our approach further shifts the boundaries of the state of the art by providing more accurate stemming results. The idea of the approach consists in building a stemmer in two stages. In the first stage, a stemming algorithm based upon clustering, which exploits the lexical and semantic information of words, is used to prepare large-scale training data for the second-stage algorithm. The second-stage algorithm uses a maximum entropy classifier. The stemming-specific features help the classifier decide when and how to stem a particular word. In our research, we have pursued the goal of creating a multi-purpose stemming tool. Its design opens up possibilities of solving non-traditional tasks such as approximating lemmas or improving language modeling. However, we still aim at very good results in the traditional task of information retrieval. The conducted tests reveal exceptional performance in all the above mentioned tasks. Our stemming method is compared with three state-of-the-art statistical algorithms and one rule-based algorithm. We used corpora in the Czech, Slovak, Polish, Hungarian, Spanish and English languages. In the tests, our algorithm excels in stemming previously unseen words (the words that are not present in the training set). Moreover, it was discovered that our approach demands very little text data for training when compared with competing unsupervised algorithms.
Escolano, C.; Costa-Jussà, M.R.; Fonollosa, J.A.: From bilingual to multilingual neural-based machine translation by incremental training (2021) 0.06
```
0.05927276 = product of:
  0.088909134 = sum of:
    0.035974823 = weight(_text_:based in 97) [ClassicSimilarity], result of:
      0.035974823 = score(doc=97,freq=4.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.23539014 = fieldWeight in 97, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0390625 = fieldNorm(doc=97)
    0.052934308 = product of:
      0.105868615 = sum of:
        0.105868615 = weight(_text_:training in 97) [ClassicSimilarity], result of:
          0.105868615 = score(doc=97,freq=6.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.44688457 = fieldWeight in 97, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.0390625 = fieldNorm(doc=97)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

A common intermediate language representation in neural machine translation can be used to extend bilingual systems by incremental training. We propose a new architecture based on introducing an interlingual loss as an additional training objective. By adding and forcing this interlingual loss, we can train multiple encoders and decoders for each language, sharing among them a common intermediate representation. Translation results on the low-resource tasks (Turkish-English and Kazakh-English tasks) show a BLEU improvement of up to 2.8 points. However, results on a larger dataset (Russian-English and Kazakh-English) show BLEU losses of a similar amount. While our system provides improvements only for the low-resource tasks in terms of translation quality, our system is capable of quickly deploying new language pairs without the need to retrain the rest of the system, which may be a game changer in some situations. Specifically, what is most relevant regarding our architecture is that it is capable of: reducing the number of production systems, with respect to the number of languages, from quadratic to linear; incrementally adding a new language to the system without retraining the languages already there; and allowing for translations from the new language to all the others present in the system.
Xiang, R.; Chersoni, E.; Lu, Q.; Huang, C.-R.; Li, W.; Long, Y.: Lexical data augmentation for sentiment analysis (2021) 0.06
```
0.05829522 = product of:
  0.08744283 = sum of:
    0.056881193 = weight(_text_:based in 392) [ClassicSimilarity], result of:
      0.056881193 = score(doc=392,freq=10.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.37218451 = fieldWeight in 392, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0390625 = fieldNorm(doc=392)
    0.030561633 = product of:
      0.061123267 = sum of:
        0.061123267 = weight(_text_:training in 392) [ClassicSimilarity], result of:
          0.061123267 = score(doc=392,freq=2.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.2580089 = fieldWeight in 392, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.0390625 = fieldNorm(doc=392)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

Machine learning methods, especially deep learning models, have achieved impressive performance in various natural language processing tasks including sentiment analysis. However, deep learning models are more demanding for training data. Data augmentation techniques are widely used to generate new instances based on modifications to existing data or relying on external knowledge bases to address annotated data scarcity, which hinders the full potential of machine learning techniques. This paper presents our work using part-of-speech (POS) focused lexical substitution for data augmentation (PLSDA) to enhance the performance of machine learning algorithms in sentiment analysis. We exploit POS information to identify words to be replaced and investigate different augmentation strategies to find semantically related substitutions when generating new instances. The choice of POS tags as well as a variety of strategies such as semantic-based substitution methods and sampling methods are discussed in detail. Performance evaluation focuses on the comparison between PLSDA and two previous lexical substitution-based data augmentation methods, one of which is thesaurus-based, and the other is lexicon manipulation based. Our approach is tested on five English sentiment analysis benchmarks: SST-2, MR, IMDB, Twitter, and AirRecord. Hyperparameters such as the candidate similarity threshold and number of newly generated instances are optimized. Results show that six classifiers (SVM, LSTM, BiLSTM-AT, bidirectional encoder representations from transformers [BERT], XLNet, and RoBERTa) trained with PLSDA achieve accuracy improvement of more than 0.6% comparing to two previous lexical substitution methods averaged on five benchmarks. Introducing POS constraint and well-designed augmentation strategies can improve the reliability of lexical data augmentation methods. Consequently, PLSDA significantly improves the performance of sentiment analysis algorithms.

Ruiz, M.E.; Srinivasan, P.: Combining machine learning and hierarchical indexing structures for text categorization (2001) 0.05

0.05226637 = product of:
  0.078399554 = sum of:
    0.03561326 = weight(_text_:based in 1595) [ClassicSimilarity], result of:
      0.03561326 = score(doc=1595,freq=2.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.23302436 = fieldWeight in 1595, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1595)
    0.04278629 = product of:
      0.08557258 = sum of:
        0.08557258 = weight(_text_:training in 1595) [ClassicSimilarity], result of:
          0.08557258 = score(doc=1595,freq=2.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.3612125 = fieldWeight in 1595, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1595)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: This paper presents a method that exploits the hierarchical structure of an indexing vocabulary to guide the development and training of machine learning methods for automatic text categorization. We present the design of a hierarchical classifier based an the divide-and-conquer principle. The method is evaluated using backpropagation neural networks, such as the machine learning algorithm, that leam to assign MeSH categories to a subset of MEDLINE records. Comparisons with traditional Rocchio's algorithm adapted for text categorization, as well as flat neural network classifiers, are provided. The results indicate that the use of hierarchical structures improves Performance significantly.

Riloff, E.: ¬An empirical study of automated dictionary construction for information extraction in three domains (1996) 0.05

0.050925426 = product of:
  0.15277627 = sum of:
    0.15277627 = sum of:
      0.09779723 = weight(_text_:training in 6752) [ClassicSimilarity], result of:
        0.09779723 = score(doc=6752,freq=2.0), product of:
          0.23690371 = queryWeight, product of:
            4.67046 = idf(docFreq=1125, maxDocs=44218)
            0.050723847 = queryNorm
          0.41281426 = fieldWeight in 6752, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.67046 = idf(docFreq=1125, maxDocs=44218)
            0.0625 = fieldNorm(doc=6752)
      0.05497905 = weight(_text_:22 in 6752) [ClassicSimilarity], result of:
        0.05497905 = score(doc=6752,freq=2.0), product of:
          0.17762627 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.050723847 = queryNorm
          0.30952093 = fieldWeight in 6752, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0625 = fieldNorm(doc=6752)
  0.33333334 = coord(1/3)

Abstract: AutoSlog is a system that addresses the knowledge engineering bottleneck for information extraction. AutoSlog automatically creates domain specific dictionaries for information extraction, given an appropriate training corpus. Describes experiments with AutoSlog in terrorism, joint ventures and microelectronics domains. Compares the performance of AutoSlog across the 3 domains, discusses the lessons learned and presents results from 2 experiments which demonstrate that novice users can generate effective dictionaries using AutoSlog
Date: 6. 3.1997 16:22:15

Yang, Y.; Lu, Q.; Zhao, T.: ¬A delimiter-based general approach for Chinese term extraction (2009) 0.05
```
0.049747746 = product of:
  0.07462162 = sum of:
    0.044059984 = weight(_text_:based in 3315) [ClassicSimilarity], result of:
      0.044059984 = score(doc=3315,freq=6.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.28829288 = fieldWeight in 3315, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3315)
    0.030561633 = product of:
      0.061123267 = sum of:
        0.061123267 = weight(_text_:training in 3315) [ClassicSimilarity], result of:
          0.061123267 = score(doc=3315,freq=2.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.2580089 = fieldWeight in 3315, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3315)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

This article addresses a two-step approach for term extraction. In the first step on term candidate extraction, a new delimiter-based approach is proposed to identify features of the delimiters of term candidates rather than those of the term candidates themselves. This delimiter-based method is much more stable and domain independent than the previous approaches. In the second step on term verification, an algorithm using link analysis is applied to calculate the relevance between term candidates and the sentences from which the terms are extracted. All information is obtained from the working domain corpus without the need for prior domain knowledge. The approach is not targeted at any specific domain and there is no need for extensive training when applying it to new domains. In other words, the method is not domain dependent and it is especially useful for resource-limited domains. Evaluations of Chinese text in two different domains show quite significant improvements over existing techniques and also verify its efficiency and its relatively domain-independent nature. The proposed method is also very effective for extracting new terms so that it can serve as an efficient tool for updating domain knowledge, especially for expanding lexicons.

Schneider, J.W.; Borlund, P.: ¬A bibliometric-based semiautomatic approach to identification of candidate thesaurus terms : parsing and filtering of noun phrases from citation contexts (2005) 0.05

0.04961206 = product of:
  0.07441809 = sum of:
    0.050364755 = weight(_text_:based in 156) [ClassicSimilarity], result of:
      0.050364755 = score(doc=156,freq=4.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.3295462 = fieldWeight in 156, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0546875 = fieldNorm(doc=156)
    0.024053333 = product of:
      0.048106667 = sum of:
        0.048106667 = weight(_text_:22 in 156) [ClassicSimilarity], result of:
          0.048106667 = score(doc=156,freq=2.0), product of:
            0.17762627 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.050723847 = queryNorm
            0.2708308 = fieldWeight in 156, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=156)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: The present study investigates the ability of a bibliometric based semi-automatic method to select candidate thesaurus terms from citation contexts. The method consists of document co-citation analysis, citation context analysis, and noun phrase parsing. The investigation is carried out within the specialty area of periodontology. The results clearly demonstrate that the method is able to select important candidate thesaurus terms within the chosen specialty area.
Date: 8. 3.2007 19:55:22

Huo, W.: Automatic multi-word term extraction and its application to Web-page summarization (2012) 0.05

0.048321307 = product of:
  0.14496392 = sum of:
    0.14496392 = sum of:
      0.10372963 = weight(_text_:training in 563) [ClassicSimilarity], result of:
        0.10372963 = score(doc=563,freq=4.0), product of:
          0.23690371 = queryWeight, product of:
            4.67046 = idf(docFreq=1125, maxDocs=44218)
            0.050723847 = queryNorm
          0.43785566 = fieldWeight in 563, product of:
            2.0 = tf(freq=4.0), with freq of:
              4.0 = termFreq=4.0
            4.67046 = idf(docFreq=1125, maxDocs=44218)
            0.046875 = fieldNorm(doc=563)
      0.041234285 = weight(_text_:22 in 563) [ClassicSimilarity], result of:
        0.041234285 = score(doc=563,freq=2.0), product of:
          0.17762627 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.050723847 = queryNorm
          0.23214069 = fieldWeight in 563, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.046875 = fieldNorm(doc=563)
  0.33333334 = coord(1/3)

Abstract: In this thesis we propose three new word association measures for multi-word term extraction. We combine these association measures with LocalMaxs algorithm in our extraction model and compare the results of different multi-word term extraction methods. Our approach is language and domain independent and requires no training data. It can be applied to such tasks as text summarization, information retrieval, and document classification. We further explore the potential of using multi-word terms as an effective representation for general web-page summarization. We extract multi-word terms from human written summaries in a large collection of web-pages, and generate the summaries by aligning document words with these multi-word terms. Our system applies machine translation technology to learn the aligning process from a training set and focuses on selecting high quality multi-word terms from human written summaries to generate suitable results for web-page summarization.
Date: 10. 1.2013 19:22:47

Basili, R.; Pazienza, M.T.; Velardi, P.: ¬An empirical symbolic approach to natural language processing (1996) 0.05

0.045460265 = product of:
  0.068190396 = sum of:
    0.040700868 = weight(_text_:based in 6753) [ClassicSimilarity], result of:
      0.040700868 = score(doc=6753,freq=2.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.26631355 = fieldWeight in 6753, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0625 = fieldNorm(doc=6753)
    0.027489524 = product of:
      0.05497905 = sum of:
        0.05497905 = weight(_text_:22 in 6753) [ClassicSimilarity], result of:
          0.05497905 = score(doc=6753,freq=2.0), product of:
            0.17762627 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.050723847 = queryNorm
            0.30952093 = fieldWeight in 6753, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0625 = fieldNorm(doc=6753)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: Describes and evaluates the results of a large scale lexical learning system, ARISTO-LEX, that uses a combination of probabilisitc and knowledge based methods for the acquisition of selectional restrictions of words in sublanguages. Presents experimental data obtained from different corpora in different doamins and languages, and shows that the acquired lexical data not only have practical applications in natural language processing, but they are useful for a comparative analysis of sublanguages
Date: 6. 3.1997 16:22:15

Haas, S.W.: Natural language processing : toward large-scale, robust systems (1996) 0.05

0.045460265 = product of:
  0.068190396 = sum of:
    0.040700868 = weight(_text_:based in 7415) [ClassicSimilarity], result of:
      0.040700868 = score(doc=7415,freq=2.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.26631355 = fieldWeight in 7415, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0625 = fieldNorm(doc=7415)
    0.027489524 = product of:
      0.05497905 = sum of:
        0.05497905 = weight(_text_:22 in 7415) [ClassicSimilarity], result of:
          0.05497905 = score(doc=7415,freq=2.0), product of:
            0.17762627 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.050723847 = queryNorm
            0.30952093 = fieldWeight in 7415, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0625 = fieldNorm(doc=7415)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: State of the art review of natural language processing updating an earlier review published in ARIST 22(1987). Discusses important developments that have allowed for significant advances in the field of natural language processing: materials and resources; knowledge based systems and statistical approaches; and a strong emphasis on evaluation. Reviews some natural language processing applications and common problems still awaiting solution. Considers closely related applications such as language generation and th egeneration phase of machine translation which face the same problems as natural language processing. Covers natural language methodologies for information retrieval only briefly

Schröter, F.; Meyer, U.: Entwicklung sprachlicher Handlungskompetenz in Englisch mit Hilfe eines Multimedia-Sprachlernsystems (2000) 0.04

0.044799745 = product of:
  0.06719962 = sum of:
    0.03052565 = weight(_text_:based in 5567) [ClassicSimilarity], result of:
      0.03052565 = score(doc=5567,freq=2.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.19973516 = fieldWeight in 5567, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.046875 = fieldNorm(doc=5567)
    0.036673963 = product of:
      0.073347926 = sum of:
        0.073347926 = weight(_text_:training in 5567) [ClassicSimilarity], result of:
          0.073347926 = score(doc=5567,freq=2.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.3096107 = fieldWeight in 5567, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.046875 = fieldNorm(doc=5567)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Theme: Computer Based Training

Dunning, T.: Statistical identification of language (1994) 0.04

0.044799745 = product of:
  0.06719962 = sum of:
    0.03052565 = weight(_text_:based in 3627) [ClassicSimilarity], result of:
      0.03052565 = score(doc=3627,freq=2.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.19973516 = fieldWeight in 3627, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.046875 = fieldNorm(doc=3627)
    0.036673963 = product of:
      0.073347926 = sum of:
        0.073347926 = weight(_text_:training in 3627) [ClassicSimilarity], result of:
          0.073347926 = score(doc=3627,freq=2.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.3096107 = fieldWeight in 3627, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.046875 = fieldNorm(doc=3627)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: A statistically based program has been written which learns to distinguish between languages. The amount of training text that such a program needs is surprisingly small, and the amount of text needed to make an identification is also quite small. The program incorporates no linguistic presuppositions other than the assumption that text can be encoded as a string of bytes. Such a program can be used to determine which language small bits of text are in. It also shows a potential for what might be called 'statistical philology' in that it may be applied directly to phonetic transcriptions to help elucidate family trees among language dialects. A variant of this program has been shown to be useful as a quality control in biochemistry. In this application, genetic sequences are assumed to be expressions in a language peculiar to the organism from which the sequence is taken. Thus language identification becomes species identification.

Duwairi, R.M.: Machine learning for Arabic text categorization (2006) 0.04

0.044799745 = product of:
  0.06719962 = sum of:
    0.03052565 = weight(_text_:based in 5115) [ClassicSimilarity], result of:
      0.03052565 = score(doc=5115,freq=2.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.19973516 = fieldWeight in 5115, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.046875 = fieldNorm(doc=5115)
    0.036673963 = product of:
      0.073347926 = sum of:
        0.073347926 = weight(_text_:training in 5115) [ClassicSimilarity], result of:
          0.073347926 = score(doc=5115,freq=2.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.3096107 = fieldWeight in 5115, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.046875 = fieldNorm(doc=5115)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: In this article we propose a distance-based classifier for categorizing Arabic text. Each category is represented as a vector of words in an m-dimensional space, and documents are classified on the basis of their closeness to feature vectors of categories. The classifier, in its learning phase, scans the set of training documents to extract features of categories that capture inherent category-specific properties; in its testing phase the classifier uses previously determined category-specific features to categorize unclassified documents. Stemming was used to reduce the dimensionality of feature vectors of documents. The accuracy of the classifier was tested by carrying out several categorization tasks on an in-house collected Arabic corpus. The results show that the proposed classifier is very accurate and robust.

Levin, M.; Krawczyk, S.; Bethard, S.; Jurafsky, D.: Citation-based bootstrapping for large-scale author disambiguation (2012) 0.04
```
0.04435764 = product of:
  0.06653646 = sum of:
    0.035974823 = weight(_text_:based in 246) [ClassicSimilarity], result of:
      0.035974823 = score(doc=246,freq=4.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.23539014 = fieldWeight in 246, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0390625 = fieldNorm(doc=246)
    0.030561633 = product of:
      0.061123267 = sum of:
        0.061123267 = weight(_text_:training in 246) [ClassicSimilarity], result of:
          0.061123267 = score(doc=246,freq=2.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.2580089 = fieldWeight in 246, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.0390625 = fieldNorm(doc=246)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

We present a new, two-stage, self-supervised algorithm for author disambiguation in large bibliographic databases. In the first "bootstrap" stage, a collection of high-precision features is used to bootstrap a training set with positive and negative examples of coreferring authors. A supervised feature-based classifier is then trained on the bootstrap clusters and used to cluster the authors in a larger unlabeled dataset. Our self-supervised approach shares the advantages of unsupervised approaches (no need for expensive hand labels) as well as supervised approaches (a rich set of features that can be discriminatively trained). The algorithm disambiguates 54,000,000 author instances in Thomson Reuters' Web of Knowledge with B3 F1 of.807. We analyze parameters and features, particularly those from citation networks, which have not been deeply investigated in author disambiguation. The most important citation feature is self-citation, which can be approximated without expensive extraction of the full network. For the supervised stage, the minor improvement due to other citation features (increasing F1 from.748 to.767) suggests they may not be worth the trouble of extracting from databases that don't already have them. A lean feature set without expensive abstract and title features performs 130 times faster with about equal F1.
Humphrey, S.M.; Rogers, W.J.; Kilicoglu, H.; Demner-Fushman, D.; Rindflesch, T.C.: Word sense disambiguation by selecting the best semantic type based on journal descriptor indexing : preliminary experiment (2006) 0.04
```
0.039798196 = product of:
  0.059697293 = sum of:
    0.035247985 = weight(_text_:based in 4912) [ClassicSimilarity], result of:
      0.035247985 = score(doc=4912,freq=6.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.2306343 = fieldWeight in 4912, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.03125 = fieldNorm(doc=4912)
    0.024449307 = product of:
      0.048898615 = sum of:
        0.048898615 = weight(_text_:training in 4912) [ClassicSimilarity], result of:
          0.048898615 = score(doc=4912,freq=2.0), product of:
            0.23690371 = queryWeight, product of:
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.050723847 = queryNorm
            0.20640713 = fieldWeight in 4912, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.67046 = idf(docFreq=1125, maxDocs=44218)
              0.03125 = fieldNorm(doc=4912)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

An experiment was performed at the National Library of Medicine® (NLM®) in word sense disambiguation (WSD) using the Journal Descriptor Indexing (JDI) methodology. The motivation is the need to solve the ambiguity problem confronting NLM's MetaMap system, which maps free text to terms corresponding to concepts in NLM's Unified Medical Language System® (UMLS®) Metathesaurus®. If the text maps to more than one Metathesaurus concept at the same high confidence score, MetaMap has no way of knowing which concept is the correct mapping. We describe the JDI methodology, which is ultimately based an statistical associations between words in a training set of MEDLINE® citations and a small set of journal descriptors (assigned by humans to journals per se) assumed to be inherited by the citations. JDI is the basis for selecting the best meaning that is correlated to UMLS semantic types (STs) assigned to ambiguous concepts in the Metathesaurus. For example, the ambiguity transport has two meanings: "Biological Transport" assigned the ST Cell Function and "Patient transport" assigned the ST Health Care Activity. A JDI-based methodology can analyze text containing transport and determine which ST receives a higher score for that text, which then returns the associated meaning, presumed to apply to the ambiguity itself. We then present an experiment in which a baseline disambiguation method was compared to four versions of JDI in disambiguating 45 ambiguous strings from NLM's WSD Test Collection. Overall average precision for the highest-scoring JDI version was 0.7873 compared to 0.2492 for the baseline method, and average precision for individual ambiguities was greater than 0.90 for 23 of them (51%), greater than 0.85 for 24 (53%), and greater than 0.65 for 35 (79%). On the basis of these results, we hope to improve performance of JDI and test its use in applications.

Schwarz, C.: THESYS: Thesaurus Syntax System : a fully automatic thesaurus building aid (1988) 0.04

0.039777733 = product of:
  0.059666596 = sum of:
    0.03561326 = weight(_text_:based in 1361) [ClassicSimilarity], result of:
      0.03561326 = score(doc=1361,freq=2.0), product of:
        0.15283063 = queryWeight, product of:
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.050723847 = queryNorm
        0.23302436 = fieldWeight in 1361, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0129938 = idf(docFreq=5906, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1361)
    0.024053333 = product of:
      0.048106667 = sum of:
        0.048106667 = weight(_text_:22 in 1361) [ClassicSimilarity], result of:
          0.048106667 = score(doc=1361,freq=2.0), product of:
            0.17762627 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.050723847 = queryNorm
            0.2708308 = fieldWeight in 1361, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1361)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Abstract: THESYS is based on the natural language processing of free-text databases. It yields statistically evaluated correlations between words of the database. These correlations correspond to traditional thesaurus relations. The person who has to build a thesaurus is thus assisted by the proposals made by THESYS. THESYS is being tested on commercial databases under real world conditions. It is part of a text processing project at Siemens, called TINA (Text-Inhalts-Analyse). Software from TINA is actually being applied and evaluated by the US Department of Commerce for patent search and indexing (REALIST: REtrieval Aids by Linguistics and STatistics)
Date: 6. 1.1999 10:22:07

Search (255 results, page 1 of 13)

Authors

Years

Languages

Types

Themes

Subjects

Classifications