Search (146 results, page 1 of 8)

  • × theme_ss:"Computerlinguistik"
  1. Kiela, D.; Clark, S.: Detecting compositionality of multi-word expressions using nearest neighbours in vector space models (2013) 0.20
    0.20439655 = product of:
      0.4087931 = sum of:
        0.24674822 = weight(_text_:vector in 1161) [ClassicSimilarity], result of:
          0.24674822 = score(doc=1161,freq=4.0), product of:
            0.30654848 = queryWeight, product of:
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.047605187 = queryNorm
            0.804924 = fieldWeight in 1161, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.0625 = fieldNorm(doc=1161)
        0.16204487 = weight(_text_:space in 1161) [ClassicSimilarity], result of:
          0.16204487 = score(doc=1161,freq=4.0), product of:
            0.24842183 = queryWeight, product of:
              5.2183776 = idf(docFreq=650, maxDocs=44218)
              0.047605187 = queryNorm
            0.6522972 = fieldWeight in 1161, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.2183776 = idf(docFreq=650, maxDocs=44218)
              0.0625 = fieldNorm(doc=1161)
      0.5 = coord(2/4)
    
    Abstract
    We present a novel unsupervised approach to detecting the compositionality of multi-word expressions. We compute the compositionality of a phrase through substituting the constituent words with their "neighbours" in a semantic vector space and averaging over the distance between the original phrase and the substituted neighbour phrases. Several methods of obtaining neighbours are presented. The results are compared to existing supervised results and achieve state-of-the-art performance on a verb-object dataset of human compositionality ratings.
  2. Savoy, J.: Searching strategies for the Hungarian language (2008) 0.18
    0.18009433 = product of:
      0.24012578 = sum of:
        0.130858 = weight(_text_:vector in 2037) [ClassicSimilarity], result of:
          0.130858 = score(doc=2037,freq=2.0), product of:
            0.30654848 = queryWeight, product of:
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.047605187 = queryNorm
            0.4268754 = fieldWeight in 2037, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.046875 = fieldNorm(doc=2037)
        0.08593727 = weight(_text_:space in 2037) [ClassicSimilarity], result of:
          0.08593727 = score(doc=2037,freq=2.0), product of:
            0.24842183 = queryWeight, product of:
              5.2183776 = idf(docFreq=650, maxDocs=44218)
              0.047605187 = queryNorm
            0.34593284 = fieldWeight in 2037, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.2183776 = idf(docFreq=650, maxDocs=44218)
              0.046875 = fieldNorm(doc=2037)
        0.023330513 = product of:
          0.046661027 = sum of:
            0.046661027 = weight(_text_:model in 2037) [ClassicSimilarity], result of:
              0.046661027 = score(doc=2037,freq=2.0), product of:
                0.1830527 = queryWeight, product of:
                  3.845226 = idf(docFreq=2569, maxDocs=44218)
                  0.047605187 = queryNorm
                0.25490487 = fieldWeight in 2037, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.845226 = idf(docFreq=2569, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2037)
          0.5 = coord(1/2)
      0.75 = coord(3/4)
    
    Abstract
    This paper reports on the underlying IR problems encountered when dealing with the complex morphology and compound constructions found in the Hungarian language. It describes evaluations carried out on two general stemming strategies for this language, and also demonstrates that a light stemming approach could be quite effective. Based on searches done on the CLEF test collection, we find that a more aggressive suffix-stripping approach may produce better MAP. When compared to an IR scheme without stemming or one based on only a light stemmer, we find the differences to be statistically significant. When compared with probabilistic, vector-space and language models, we find that the Okapi model results in the best retrieval effectiveness. The resulting MAP is found to be about 35% better than the classical tf idf approach, particularly for very short requests. Finally, we demonstrate that applying an automatic decompounding procedure for both queries and documents significantly improves IR performance (+10%), compared to word-based indexing strategies.
  3. Duwairi, R.M.: Machine learning for Arabic text categorization (2006) 0.11
    0.10839763 = product of:
      0.21679527 = sum of:
        0.130858 = weight(_text_:vector in 5115) [ClassicSimilarity], result of:
          0.130858 = score(doc=5115,freq=2.0), product of:
            0.30654848 = queryWeight, product of:
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.047605187 = queryNorm
            0.4268754 = fieldWeight in 5115, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.046875 = fieldNorm(doc=5115)
        0.08593727 = weight(_text_:space in 5115) [ClassicSimilarity], result of:
          0.08593727 = score(doc=5115,freq=2.0), product of:
            0.24842183 = queryWeight, product of:
              5.2183776 = idf(docFreq=650, maxDocs=44218)
              0.047605187 = queryNorm
            0.34593284 = fieldWeight in 5115, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.2183776 = idf(docFreq=650, maxDocs=44218)
              0.046875 = fieldNorm(doc=5115)
      0.5 = coord(2/4)
    
    Abstract
    In this article we propose a distance-based classifier for categorizing Arabic text. Each category is represented as a vector of words in an m-dimensional space, and documents are classified on the basis of their closeness to feature vectors of categories. The classifier, in its learning phase, scans the set of training documents to extract features of categories that capture inherent category-specific properties; in its testing phase the classifier uses previously determined category-specific features to categorize unclassified documents. Stemming was used to reduce the dimensionality of feature vectors of documents. The accuracy of the classifier was tested by carrying out several categorization tasks on an in-house collected Arabic corpus. The results show that the proposed classifier is very accurate and robust.
  4. Peng, F.; Huang, X.: Machine learning for Asian language text classification (2007) 0.06
    0.06424522 = product of:
      0.12849043 = sum of:
        0.10904834 = weight(_text_:vector in 831) [ClassicSimilarity], result of:
          0.10904834 = score(doc=831,freq=2.0), product of:
            0.30654848 = queryWeight, product of:
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.047605187 = queryNorm
            0.3557295 = fieldWeight in 831, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.0390625 = fieldNorm(doc=831)
        0.019442094 = product of:
          0.03888419 = sum of:
            0.03888419 = weight(_text_:model in 831) [ClassicSimilarity], result of:
              0.03888419 = score(doc=831,freq=2.0), product of:
                0.1830527 = queryWeight, product of:
                  3.845226 = idf(docFreq=2569, maxDocs=44218)
                  0.047605187 = queryNorm
                0.21242073 = fieldWeight in 831, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.845226 = idf(docFreq=2569, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=831)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    Purpose - The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task. Design/methodology/approach - Naïve Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentation-based approach was compared with the non-segmentation-based approach. Findings - There were two findings: the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features; and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy. Practical implications - Apply the findings to real web text classification is ongoing work. Originality/value - The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.
  5. SIGIR'92 : Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1992) 0.06
    0.06323195 = product of:
      0.1264639 = sum of:
        0.076333836 = weight(_text_:vector in 6671) [ClassicSimilarity], result of:
          0.076333836 = score(doc=6671,freq=2.0), product of:
            0.30654848 = queryWeight, product of:
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.047605187 = queryNorm
            0.24901065 = fieldWeight in 6671, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.02734375 = fieldNorm(doc=6671)
        0.050130073 = weight(_text_:space in 6671) [ClassicSimilarity], result of:
          0.050130073 = score(doc=6671,freq=2.0), product of:
            0.24842183 = queryWeight, product of:
              5.2183776 = idf(docFreq=650, maxDocs=44218)
              0.047605187 = queryNorm
            0.20179415 = fieldWeight in 6671, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.2183776 = idf(docFreq=650, maxDocs=44218)
              0.02734375 = fieldNorm(doc=6671)
      0.5 = coord(2/4)
    
    Content
    HARMAN, D.: Relevance feedback revisited; AALBERSBERG, I.J.: Incremental relevance feedback; TAGUE-SUTCLIFFE, J.: Measuring the informativeness of a retrieval process; LEWIS, D.D.: An evaluation of phrasal and clustered representations on a text categorization task; BLOSSEVILLE, M.J., G. HÉBRAIL, M.G. MONTEIL u. N. PÉNOT: Automatic document classification: natural language processing, statistical analysis, and expert system techniques used together; MASAND, B., G. LINOFF u. D. WALTZ: Classifying news stories using memory based reasoning; KEEN, E.M.: Term position ranking: some new test results; CROUCH, C.J. u. B. YANG: Experiments in automatic statistical thesaurus construction; GREFENSTETTE, G.: Use of syntactic context to produce term association lists for text retrieval; ANICK, P.G. u. R.A. FLYNN: Versioning of full-text information retrieval system; BURKOWSKI, F.J.: Retrieval activities in a database consisting of heterogeneous collections; DEERWESTER, S.C., K. WACLENA u. M. LaMAR: A textual object management system; NIE, J.-Y.:Towards a probabilistic modal logic for semantic-based information retrieval; WANG, A.W., S.K.M. WONG u. Y.Y. YAO: An analysis of vector space models based on computational geometry; BARTELL, B.T., G.W. COTTRELL u. R.K. BELEW: Latent semantic indexing is an optimal special case of multidimensional scaling; GLAVITSCH, U. u. P. SCHÄUBLE: A system for retrieving speech documents; MARGULIS, E.L.: N-Poisson document modelling; HESS, M.: An incrementally extensible document retrieval system based on linguistics and logical principles; COOPER, W.S., F.C. GEY u. D.P. DABNEY: Probabilistic retrieval based on staged logistic regression; FUHR, N.: Integration of probabilistic fact and text retrieval; CROFT, B., L.A. SMITH u. H. TURTLE: A loosely-coupled integration of a text retrieval system and an object-oriented database system; DUMAIS, S.T. u. J. NIELSEN: Automating the assignement of submitted manuscripts to reviewers; GOST, M.A. u. M. MASOTTI: Design of an OPAC database to permit different subject searching accesses; ROBERTSON, A.M. u. P. WILLETT: Searching for historical word forms in a database of 17th century English text using spelling correction methods; FAX, E.A., Q.F. CHEN u. L.S. HEATH: A faster algorithm for constructing minimal perfect hash functions; MOFFAT, A. u. J. ZOBEL: Parameterised compression for sparse bitmaps; GRANDI, F., P. TIBERIO u. P. Zezula: Frame-sliced patitioned parallel signature files; ALLEN, B.: Cognitive differences in end user searching of a CD-ROM index; SONNENWALD, D.H.: Developing a theory to guide the process of designing information retrieval systems; CUTTING, D.R., J.O. PEDERSEN, D. KARGER, u. J.W. TUKEY: Scatter/ Gather: a cluster-based approach to browsing large document collections; CHALMERS, M. u. P. CHITSON: Bead: Explorations in information visualization; WILLIAMSON, C. u. B. SHNEIDERMAN: The dynamic HomeFinder: evaluating dynamic queries in a real-estate information exploring system
  6. Lu, K.; Cai, X.; Ajiferuke, I.; Wolfram, D.: Vocabulary size and its effect on topic representation (2017) 0.05
    0.054633893 = product of:
      0.109267786 = sum of:
        0.08593727 = weight(_text_:space in 3414) [ClassicSimilarity], result of:
          0.08593727 = score(doc=3414,freq=2.0), product of:
            0.24842183 = queryWeight, product of:
              5.2183776 = idf(docFreq=650, maxDocs=44218)
              0.047605187 = queryNorm
            0.34593284 = fieldWeight in 3414, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.2183776 = idf(docFreq=650, maxDocs=44218)
              0.046875 = fieldNorm(doc=3414)
        0.023330513 = product of:
          0.046661027 = sum of:
            0.046661027 = weight(_text_:model in 3414) [ClassicSimilarity], result of:
              0.046661027 = score(doc=3414,freq=2.0), product of:
                0.1830527 = queryWeight, product of:
                  3.845226 = idf(docFreq=2569, maxDocs=44218)
                  0.047605187 = queryNorm
                0.25490487 = fieldWeight in 3414, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.845226 = idf(docFreq=2569, maxDocs=44218)
                  0.046875 = fieldNorm(doc=3414)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    This study investigates how computational overhead for topic model training may be reduced by selectively removing terms from the vocabulary of text corpora being modeled. We compare the impact of removing singly occurring terms, the top 0.5%, 1% and 5% most frequently occurring terms and both top 0.5% most frequent and singly occurring terms, along with changes in the number of topics modeled (10, 20, 30, 40, 50, 100) using three datasets. Four outcome measures are compared. The removal of singly occurring terms has little impact on outcomes for all of the measures tested. Document discriminative capacity, as measured by the document space density, is reduced by the removal of frequently occurring terms, but increases with higher numbers of topics. Vocabulary size does not greatly influence entropy, but entropy is affected by the number of topics. Finally, topic similarity, as measured by pairwise topic similarity and Jensen-Shannon divergence, decreases with the removal of frequent terms. The findings have implications for information science research in information retrieval and informetrics that makes use of topic modeling.
  7. McMahon, J.G.; Smith, F.J.: Improved statistical language model performance with automatic generated word hierarchies (1996) 0.05
    0.049793392 = product of:
      0.19917357 = sum of:
        0.19917357 = sum of:
          0.10887573 = weight(_text_:model in 3164) [ClassicSimilarity], result of:
            0.10887573 = score(doc=3164,freq=2.0), product of:
              0.1830527 = queryWeight, product of:
                3.845226 = idf(docFreq=2569, maxDocs=44218)
                0.047605187 = queryNorm
              0.59477806 = fieldWeight in 3164, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.845226 = idf(docFreq=2569, maxDocs=44218)
                0.109375 = fieldNorm(doc=3164)
          0.09029783 = weight(_text_:22 in 3164) [ClassicSimilarity], result of:
            0.09029783 = score(doc=3164,freq=2.0), product of:
              0.16670525 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.047605187 = queryNorm
              0.5416616 = fieldWeight in 3164, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.109375 = fieldNorm(doc=3164)
      0.25 = coord(1/4)
    
    Source
    Computational linguistics. 22(1996) no.2, S.217-248
  8. Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.05
    0.047479596 = product of:
      0.09495919 = sum of:
        0.075609654 = product of:
          0.22682896 = sum of:
            0.22682896 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
              0.22682896 = score(doc=562,freq=2.0), product of:
                0.4035973 = queryWeight, product of:
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.047605187 = queryNorm
                0.56201804 = fieldWeight in 562, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.046875 = fieldNorm(doc=562)
          0.33333334 = coord(1/3)
        0.019349536 = product of:
          0.03869907 = sum of:
            0.03869907 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
              0.03869907 = score(doc=562,freq=2.0), product of:
                0.16670525 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.047605187 = queryNorm
                0.23214069 = fieldWeight in 562, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.046875 = fieldNorm(doc=562)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Content
    Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
    Date
    8. 1.2013 10:22:32
  9. Goller, C.; Löning, J.; Will, T.; Wolff, W.: Automatic document classification : a thourough evaluation of various methods (2000) 0.05
    0.046265293 = product of:
      0.18506117 = sum of:
        0.18506117 = weight(_text_:vector in 5480) [ClassicSimilarity], result of:
          0.18506117 = score(doc=5480,freq=4.0), product of:
            0.30654848 = queryWeight, product of:
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.047605187 = queryNorm
            0.603693 = fieldWeight in 5480, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.046875 = fieldNorm(doc=5480)
      0.25 = coord(1/4)
    
    Abstract
    (Automatic) document classification is generally defined as content-based assignment of one or more predefined categories to documents. Usually, machine learning, statistical pattern recognition, or neural network approaches are used to construct classifiers automatically. In this paper we thoroughly evaluate a wide variety of these methods on a document classification task for German text. We evaluate different feature construction and selection methods and various classifiers. Our main results are: (1) feature selection is necessary not only to reduce learning and classification time, but also to avoid overfitting (even for Support Vector Machines); (2) surprisingly, our morphological analysis does not improve classification quality compared to a letter 5-gram approach; (3) Support Vector Machines are significantly better than all other classification methods
  10. Malo, P.; Sinha, A.; Korhonen, P.; Wallenius, J.; Takala, P.: Good debt or bad debt : detecting semantic orientations in economic texts (2014) 0.05
    0.045528244 = product of:
      0.09105649 = sum of:
        0.07161439 = weight(_text_:space in 1226) [ClassicSimilarity], result of:
          0.07161439 = score(doc=1226,freq=2.0), product of:
            0.24842183 = queryWeight, product of:
              5.2183776 = idf(docFreq=650, maxDocs=44218)
              0.047605187 = queryNorm
            0.28827736 = fieldWeight in 1226, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.2183776 = idf(docFreq=650, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1226)
        0.019442094 = product of:
          0.03888419 = sum of:
            0.03888419 = weight(_text_:model in 1226) [ClassicSimilarity], result of:
              0.03888419 = score(doc=1226,freq=2.0), product of:
                0.1830527 = queryWeight, product of:
                  3.845226 = idf(docFreq=2569, maxDocs=44218)
                  0.047605187 = queryNorm
                0.21242073 = fieldWeight in 1226, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.845226 = idf(docFreq=2569, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1226)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    The use of robo-readers to analyze news texts is an emerging technology trend in computational finance. Recent research has developed sophisticated financial polarity lexicons for investigating how financial sentiments relate to future company performance. However, based on experience from fields that commonly analyze sentiment, it is well known that the overall semantic orientation of a sentence may differ from that of individual words. This article investigates how semantic orientations can be better detected in financial and economic news by accommodating the overall phrase-structure information and domain-specific use of language. Our three main contributions are the following: (a) a human-annotated finance phrase bank that can be used for training and evaluating alternative models; (b) a technique to enhance financial lexicons with attributes that help to identify expected direction of events that affect sentiment; and (c) a linearized phrase-structure model for detecting contextual semantic orientations in economic texts. The relevance of the newly added lexicon features and the benefit of using the proposed learning algorithm are demonstrated in a comparative study against general sentiment models as well as the popular word frequency models used in recent financial studies. The proposed framework is parsimonious and avoids the explosion in feature space caused by the use of conventional n-gram features.
  11. Luo, L.; Ju, J.; Li, Y.-F.; Haffari, G.; Xiong, B.; Pan, S.: ChatRule: mining logical rules with large language models for knowledge graph reasoning (2023) 0.04
    0.043869503 = product of:
      0.087739006 = sum of:
        0.07161439 = weight(_text_:space in 1171) [ClassicSimilarity], result of:
          0.07161439 = score(doc=1171,freq=2.0), product of:
            0.24842183 = queryWeight, product of:
              5.2183776 = idf(docFreq=650, maxDocs=44218)
              0.047605187 = queryNorm
            0.28827736 = fieldWeight in 1171, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.2183776 = idf(docFreq=650, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1171)
        0.016124614 = product of:
          0.032249227 = sum of:
            0.032249227 = weight(_text_:22 in 1171) [ClassicSimilarity], result of:
              0.032249227 = score(doc=1171,freq=2.0), product of:
                0.16670525 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.047605187 = queryNorm
                0.19345059 = fieldWeight in 1171, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1171)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    Logical rules are essential for uncovering the logical connections between relations, which could improve the reasoning performance and provide interpretable results on knowledge graphs (KGs). Although there have been many efforts to mine meaningful logical rules over KGs, existing methods suffer from the computationally intensive searches over the rule space and a lack of scalability for large-scale KGs. Besides, they often ignore the semantics of relations which is crucial for uncovering logical connections. Recently, large language models (LLMs) have shown impressive performance in the field of natural language processing and various applications, owing to their emergent ability and generalizability. In this paper, we propose a novel framework, ChatRule, unleashing the power of large language models for mining logical rules over knowledge graphs. Specifically, the framework is initiated with an LLM-based rule generator, leveraging both the semantic and structural information of KGs to prompt LLMs to generate logical rules. To refine the generated rules, a rule ranking module estimates the rule quality by incorporating facts from existing KGs. Last, a rule validator harnesses the reasoning ability of LLMs to validate the logical correctness of ranked rules through chain-of-thought reasoning. ChatRule is evaluated on four large-scale KGs, w.r.t. different rule quality metrics and downstream tasks, showing the effectiveness and scalability of our method.
    Date
    23.11.2023 19:07:22
  12. Moohebat, M.; Raj, R.G.; Kareem, S.B.A.; Thorleuchter, D.: Identifying ISI-indexed articles by their lexical usage : a text analysis approach (2015) 0.03
    0.0327145 = product of:
      0.130858 = sum of:
        0.130858 = weight(_text_:vector in 1664) [ClassicSimilarity], result of:
          0.130858 = score(doc=1664,freq=2.0), product of:
            0.30654848 = queryWeight, product of:
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.047605187 = queryNorm
            0.4268754 = fieldWeight in 1664, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.046875 = fieldNorm(doc=1664)
      0.25 = coord(1/4)
    
    Abstract
    This research creates an architecture for investigating the existence of probable lexical divergences between articles, categorized as Institute for Scientific Information (ISI) and non-ISI, and consequently, if such a difference is discovered, to propose the best available classification method. Based on a collection of ISI- and non-ISI-indexed articles in the areas of business and computer science, three classification models are trained. A sensitivity analysis is applied to demonstrate the impact of words in different syntactical forms on the classification decision. The results demonstrate that the lexical domains of ISI and non-ISI articles are distinguishable by machine learning techniques. Our findings indicate that the support vector machine identifies ISI-indexed articles in both disciplines with higher precision than do the Naïve Bayesian and K-Nearest Neighbors techniques.
  13. AL-Smadi, M.; Jaradat, Z.; AL-Ayyoub, M.; Jararweh, Y.: Paraphrase identification and semantic text similarity analysis in Arabic news tweets using lexical, syntactic, and semantic features (2017) 0.03
    0.0327145 = product of:
      0.130858 = sum of:
        0.130858 = weight(_text_:vector in 5095) [ClassicSimilarity], result of:
          0.130858 = score(doc=5095,freq=2.0), product of:
            0.30654848 = queryWeight, product of:
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.047605187 = queryNorm
            0.4268754 = fieldWeight in 5095, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.046875 = fieldNorm(doc=5095)
      0.25 = coord(1/4)
    
    Abstract
    The rapid growth in digital information has raised considerable challenges in particular when it comes to automated content analysis. Social media such as twitter share a lot of its users' information about their events, opinions, personalities, etc. Paraphrase Identification (PI) is concerned with recognizing whether two texts have the same/similar meaning, whereas the Semantic Text Similarity (STS) is concerned with the degree of that similarity. This research proposes a state-of-the-art approach for paraphrase identification and semantic text similarity analysis in Arabic news tweets. The approach adopts several phases of text processing, features extraction and text classification. Lexical, syntactic, and semantic features are extracted to overcome the weakness and limitations of the current technologies in solving these tasks for the Arabic language. Maximum Entropy (MaxEnt) and Support Vector Regression (SVR) classifiers are trained using these features and are evaluated using a dataset prepared for this research. The experimentation results show that the approach achieves good results in comparison to the baseline results.
  14. Corbara, S.; Moreo, A.; Sebastiani, F.: Syllabic quantity patterns as rhythmic features for Latin authorship attribution (2023) 0.03
    0.0327145 = product of:
      0.130858 = sum of:
        0.130858 = weight(_text_:vector in 846) [ClassicSimilarity], result of:
          0.130858 = score(doc=846,freq=2.0), product of:
            0.30654848 = queryWeight, product of:
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.047605187 = queryNorm
            0.4268754 = fieldWeight in 846, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.046875 = fieldNorm(doc=846)
      0.25 = coord(1/4)
    
    Abstract
    It is well known that, within the Latin production of written text, peculiar metric schemes were followed not only in poetic compositions, but also in many prose works. Such metric patterns were based on so-called syllabic quantity, that is, on the length of the involved syllables, and there is substantial evidence suggesting that certain authors had a preference for certain metric patterns over others. In this research we investigate the possibility to employ syllabic quantity as a base for deriving rhythmic features for the task of computational authorship attribution of Latin prose texts. We test the impact of these features on the authorship attribution task when combined with other topic-agnostic features. Our experiments, carried out on three different datasets using support vector machines (SVMs) show that rhythmic features based on syllabic quantity are beneficial in discriminating among Latin prose authors.
  15. Diaz, I.; Morato, J.; Lioréns, J.: ¬An algorithm for term conflation based on tree structures (2002) 0.03
    0.028645756 = product of:
      0.11458302 = sum of:
        0.11458302 = weight(_text_:space in 246) [ClassicSimilarity], result of:
          0.11458302 = score(doc=246,freq=2.0), product of:
            0.24842183 = queryWeight, product of:
              5.2183776 = idf(docFreq=650, maxDocs=44218)
              0.047605187 = queryNorm
            0.46124378 = fieldWeight in 246, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.2183776 = idf(docFreq=650, maxDocs=44218)
              0.0625 = fieldNorm(doc=246)
      0.25 = coord(1/4)
    
    Abstract
    This work presents a new stemming algorithm. This algorithm stores the stemming information in tree structures. This storage allows us to enhance the performance of the algorithm due to the reduction of the search space and the overall complexity. The final result of that stemming algorithm is a normalized concept, understanding this process as the automatic extraction of the generic form (or a lexeme) for a selected term.
  16. Fegley, B.D.; Torvik, V.I.: On the role of poetic versus nonpoetic features in "kindred" and diachronic poetry attribution (2012) 0.03
    0.027262084 = product of:
      0.10904834 = sum of:
        0.10904834 = weight(_text_:vector in 488) [ClassicSimilarity], result of:
          0.10904834 = score(doc=488,freq=2.0), product of:
            0.30654848 = queryWeight, product of:
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.047605187 = queryNorm
            0.3557295 = fieldWeight in 488, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.0390625 = fieldNorm(doc=488)
      0.25 = coord(1/4)
    
    Abstract
    Author attribution studies have demonstrated remarkable success in applying orthographic and lexicographic features of text in a variety of discrimination problems. What might poetic features, such as syllabic stress and mood, contribute? We address this question in the context of two different attribution problems: (a) kindred: differentiate Langston Hughes' early poems from those of kindred poets and (b) diachronic: differentiate Hughes' early from his later poems. Using a diverse set of 535 generic text features, each categorized as poetic or nonpoetic, correlation-based greedy forward search ranked the features and a support vector machine classified the poems. A small subset of features (~10) achieved cross-validated precision and recall as high as 87%. Poetic features (rhyme patterns particularly) were nearly as effective as nonpoetic in kindred discrimination, but less effective diachronically. In other words, Hughes used both poetic and nonpoetic features in distinctive ways and his use of nonpoetic features evolved systematically while he continued to experiment with poetic features. These findings affirm qualitative studies attesting to structural elements from Black oral tradition and Black folk music (blues) and to the internal consistency of Hughes' early poetry.
  17. Soni, S.; Lerman, K.; Eisenstein, J.: Follow the leader : documents on the leading edge of semantic change get more citations (2021) 0.03
    0.027262084 = product of:
      0.10904834 = sum of:
        0.10904834 = weight(_text_:vector in 169) [ClassicSimilarity], result of:
          0.10904834 = score(doc=169,freq=2.0), product of:
            0.30654848 = queryWeight, product of:
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.047605187 = queryNorm
            0.3557295 = fieldWeight in 169, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.0390625 = fieldNorm(doc=169)
      0.25 = coord(1/4)
    
    Abstract
    Diachronic word embeddings-vector representations of words over time-offer remarkable insights into the evolution of language and provide a tool for quantifying sociocultural change from text documents. Prior work has used such embeddings to identify shifts in the meaning of individual words. However, simply knowing that a word has changed in meaning is insufficient to identify the instances of word usage that convey the historical meaning or the newer meaning. In this study, we link diachronic word embeddings to documents, by situating those documents as leaders or laggards with respect to ongoing semantic changes. Specifically, we propose a novel method to quantify the degree of semantic progressiveness in each word usage, and then show how these usages can be aggregated to obtain scores for each document. We analyze two large collections of documents, representing legal opinions and scientific articles. Documents that are scored as semantically progressive receive a larger number of citations, indicating that they are especially influential. Our work thus provides a new technique for identifying lexical semantic leaders and demonstrates a new link between progressive use of language and influence in a citation network.
  18. Fóris, A.: Network theory and terminology (2013) 0.02
    0.024899656 = product of:
      0.099598624 = sum of:
        0.099598624 = sum of:
          0.0673494 = weight(_text_:model in 1365) [ClassicSimilarity], result of:
            0.0673494 = score(doc=1365,freq=6.0), product of:
              0.1830527 = queryWeight, product of:
                3.845226 = idf(docFreq=2569, maxDocs=44218)
                0.047605187 = queryNorm
              0.36792353 = fieldWeight in 1365, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                3.845226 = idf(docFreq=2569, maxDocs=44218)
                0.0390625 = fieldNorm(doc=1365)
          0.032249227 = weight(_text_:22 in 1365) [ClassicSimilarity], result of:
            0.032249227 = score(doc=1365,freq=2.0), product of:
              0.16670525 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.047605187 = queryNorm
              0.19345059 = fieldWeight in 1365, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.0390625 = fieldNorm(doc=1365)
      0.25 = coord(1/4)
    
    Abstract
    The paper aims to present the relations of network theory and terminology. The model of scale-free networks, which has been recently developed and widely applied since, can be effectively used in terminology research as well. Operation based on the principle of networks is a universal characteristic of complex systems. Networks are governed by general laws. The model of scale-free networks can be viewed as a statistical-probability model, and it can be described with mathematical tools. Its main feature is that "everything is connected to everything else," that is, every node is reachable (in a few steps) starting from any other node; this phenomena is called "the small world phenomenon." The existence of a linguistic network and the general laws of the operation of networks enable us to place issues of language use in the complex system of relations that reveal the deeper connection s between phenomena with the help of networks embedded in each other. The realization of the metaphor that language also has a network structure is the basis of the classification methods of the terminological system, and likewise of the ways of creating terminology databases, which serve the purpose of providing easy and versatile accessibility to specialised knowledge.
    Date
    2. 9.2014 21:22:48
  19. Hammwöhner, R.: TransRouter revisited : Decision support in the routing of translation projects (2000) 0.02
    0.024896696 = product of:
      0.099586785 = sum of:
        0.099586785 = sum of:
          0.054437865 = weight(_text_:model in 5483) [ClassicSimilarity], result of:
            0.054437865 = score(doc=5483,freq=2.0), product of:
              0.1830527 = queryWeight, product of:
                3.845226 = idf(docFreq=2569, maxDocs=44218)
                0.047605187 = queryNorm
              0.29738903 = fieldWeight in 5483, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.845226 = idf(docFreq=2569, maxDocs=44218)
                0.0546875 = fieldNorm(doc=5483)
          0.045148917 = weight(_text_:22 in 5483) [ClassicSimilarity], result of:
            0.045148917 = score(doc=5483,freq=2.0), product of:
              0.16670525 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.047605187 = queryNorm
              0.2708308 = fieldWeight in 5483, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.0546875 = fieldNorm(doc=5483)
      0.25 = coord(1/4)
    
    Abstract
    This paper gives an outline of the final results of the TransRouter project. In the scope of this project a decision support system for translation managers has been developed, which will support the selection of appropriate routes for translation projects. In this paper emphasis is put on the decision model, which is based on a stepwise refined assessment of translation routes. The workflow of using this system is considered as well
    Date
    10.12.2000 18:22:35
  20. Kim, W.; Wilbur, W.J.: Corpus-based statistical screening for content-bearing terms (2001) 0.02
    0.021809667 = product of:
      0.08723867 = sum of:
        0.08723867 = weight(_text_:vector in 5188) [ClassicSimilarity], result of:
          0.08723867 = score(doc=5188,freq=2.0), product of:
            0.30654848 = queryWeight, product of:
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.047605187 = queryNorm
            0.2845836 = fieldWeight in 5188, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              6.439392 = idf(docFreq=191, maxDocs=44218)
              0.03125 = fieldNorm(doc=5188)
      0.25 = coord(1/4)
    
    Abstract
    Kim and Wilber present three techniques for the algorithmic identification in text of content bearing terms and phrases intended for human use as entry points or hyperlinks. Using a set of 1,075 terms from MEDLINE evaluated on a zero to four, stop word to definite content word scale, they evaluate the ranked lists of their three methods based on their placement of content words in the top ranks. Data consist of the natural language elements of 304,057 MEDLINE records from 1996, and 173,252 Wall Street Journal records from the TIPSTER collection. Phrases are extracted by breaking at punctuation marks and stop words, normalized by lower casing, replacement of nonalphanumerics with spaces, and the reduction of multiple spaces. In the ``strength of context'' approach each document is a vector of binary values for each word or word pair. The words or word pairs are removed from all documents, and the Robertson, Spark Jones relevance weight for each term computed, negative weights replaced with zero, those below a randomness threshold ignored, and the remainder summed for each document, to yield a score for the document and finally to assign to the term the average document score for documents in which it occurred. The average of these word scores is assigned to the original phrase. The ``frequency clumping'' approach defines a random phrase as one whose distribution among documents is Poisson in character. A pvalue, the probability that a phrase frequency of occurrence would be equal to, or less than, Poisson expectations is computed, and a score assigned which is the negative log of that value. In the ``database comparison'' approach if a phrase occurring in a document allows prediction that the document is in MEDLINE rather that in the Wall Street Journal, it is considered to be content bearing for MEDLINE. The score is computed by dividing the number of occurrences of the term in MEDLINE by occurrences in the Journal, and taking the product of all these values. The one hundred top and bottom ranked phrases that occurred in at least 500 documents were collected for each method. The union set had 476 phrases. A second selection was made of two word phrases occurring each in only three documents with a union of 599 phrases. A judge then ranked the two sets of terms as to subject specificity on a 0 to 4 scale. Precision was the average subject specificity of the first r ranks and recall the fraction of the subject specific phrases in the first r ranks and eleven point average precision was used as a summary measure. The three methods all move content bearing terms forward in the lists as does the use of the sum of the logs of the three methods.

Years

Languages

  • e 122
  • d 19
  • f 2
  • chi 1
  • ru 1
  • More… Less…

Types

  • a 121
  • el 22
  • m 9
  • s 6
  • x 5
  • p 4
  • d 1
  • More… Less…