Search (123 results, page 3 of 7)

  • × theme_ss:"Computerlinguistik"
  • × year_i:[2010 TO 2020}
  1. Zadeh, B.Q.; Handschuh, S.: ¬The ACL RD-TEC : a dataset for benchmarking terminology extraction and classification in computational linguistics (2014) 0.00
    0.0020392092 = product of:
      0.0040784185 = sum of:
        0.0040784185 = product of:
          0.008156837 = sum of:
            0.008156837 = weight(_text_:a in 2803) [ClassicSimilarity], result of:
              0.008156837 = score(doc=2803,freq=10.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.1709182 = fieldWeight in 2803, product of:
                  3.1622777 = tf(freq=10.0), with freq of:
                    10.0 = termFreq=10.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2803)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    This paper introduces ACL RD-TEC: a dataset for evaluating the extraction and classification of terms from literature in the domain of computational linguistics. The dataset is derived from the Association for Computational Linguistics anthology reference corpus (ACL ARC). In its first release, the ACL RD-TEC consists of automatically segmented, part-of-speech-tagged ACL ARC documents, three lists of candidate terms, and more than 82,000 manually annotated terms. The annotated terms are marked as either valid or invalid, and valid terms are further classified as technology and non-technology terms. Technology terms signify methods, algorithms, and solutions in computational linguistics. The paper describes the dataset and reports the relevant statistics. We hope the step described in this paper encourages a collaborative effort towards building a full-fledged annotated corpus from the computational linguistics literature.
    Type
    a
  2. Engerer, V.: Exploring interdisciplinary relationships between linguistics and information retrieval from the 1960s to today (2017) 0.00
    0.0020392092 = product of:
      0.0040784185 = sum of:
        0.0040784185 = product of:
          0.008156837 = sum of:
            0.008156837 = weight(_text_:a in 3434) [ClassicSimilarity], result of:
              0.008156837 = score(doc=3434,freq=10.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.1709182 = fieldWeight in 3434, product of:
                  3.1622777 = tf(freq=10.0), with freq of:
                    10.0 = termFreq=10.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046875 = fieldNorm(doc=3434)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    This article explores how linguistics has influenced information retrieval (IR) and attempts to explain the impact of linguistics through an analysis of internal developments in information science generally, and IR in particular. It notes that information science/IR has been evolving from a case science into a fully fledged, "disciplined"/disciplinary science. The article establishes correspondences between linguistics and information science/IR using the three established IR paradigms-physical, cognitive, and computational-as a frame of reference. The current relationship between information science/IR and linguistics is elucidated through discussion of some recent information science publications dealing with linguistic topics and a novel technique, "keyword collocation analysis," is introduced. Insights from interdisciplinarity research and case theory are also discussed. It is demonstrated that the three stages of interdisciplinarity, namely multidisciplinarity, interdisciplinarity (in the narrow sense), and transdisciplinarity, can be linked to different phases of the information science/IR-linguistics relationship and connected to different ways of using linguistic theory in information science and IR.
    Type
    a
  3. Korman, D.Z.; Mack, E.; Jett, J.; Renear, A.H.: Defining textual entailment (2018) 0.00
    0.0020392092 = product of:
      0.0040784185 = sum of:
        0.0040784185 = product of:
          0.008156837 = sum of:
            0.008156837 = weight(_text_:a in 4284) [ClassicSimilarity], result of:
              0.008156837 = score(doc=4284,freq=10.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.1709182 = fieldWeight in 4284, product of:
                  3.1622777 = tf(freq=10.0), with freq of:
                    10.0 = termFreq=10.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046875 = fieldNorm(doc=4284)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Textual entailment is a relationship that obtains between fragments of text when one fragment in some sense implies the other fragment. The automation of textual entailment recognition supports a wide variety of text-based tasks, including information retrieval, information extraction, question answering, text summarization, and machine translation. Much ingenuity has been devoted to developing algorithms for identifying textual entailments, but relatively little to saying what textual entailment actually is. This article is a review of the logical and philosophical issues involved in providing an adequate definition of textual entailment. We show that many natural definitions of textual entailment are refuted by counterexamples, including the most widely cited definition of Dagan et al. We then articulate and defend the following revised definition: T textually entails H?=?df typically, a human reading T would be justified in inferring the proposition expressed by H from the proposition expressed by T. We also show that textual entailment is context-sensitive, nontransitive, and nonmonotonic.
    Type
    a
  4. Lu, C.; Bu, Y.; Wang, J.; Ding, Y.; Torvik, V.; Schnaars, M.; Zhang, C.: Examining scientific writing styles from the perspective of linguistic complexity : a cross-level moderation model (2019) 0.00
    0.0020392092 = product of:
      0.0040784185 = sum of:
        0.0040784185 = product of:
          0.008156837 = sum of:
            0.008156837 = weight(_text_:a in 5219) [ClassicSimilarity], result of:
              0.008156837 = score(doc=5219,freq=10.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.1709182 = fieldWeight in 5219, product of:
                  3.1622777 = tf(freq=10.0), with freq of:
                    10.0 = termFreq=10.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046875 = fieldNorm(doc=5219)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Publishing articles in high-impact English journals is difficult for scholars around the world, especially for non-native English-speaking scholars (NNESs), most of whom struggle with proficiency in English. To uncover the differences in English scientific writing between native English-speaking scholars (NESs) and NNESs, we collected a large-scale data set containing more than 150,000 full-text articles published in PLoS between 2006 and 2015. We divided these articles into three groups according to the ethnic backgrounds of the first and corresponding authors, obtained by Ethnea, and examined the scientific writing styles in English from a two-fold perspective of linguistic complexity: (a) syntactic complexity, including measurements of sentence length and sentence complexity; and (b) lexical complexity, including measurements of lexical diversity, lexical density, and lexical sophistication. The observations suggest marginal differences between groups in syntactical and lexical complexity.
    Type
    a
  5. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I.: Attention Is all you need (2017) 0.00
    0.0020392092 = product of:
      0.0040784185 = sum of:
        0.0040784185 = product of:
          0.008156837 = sum of:
            0.008156837 = weight(_text_:a in 970) [ClassicSimilarity], result of:
              0.008156837 = score(doc=970,freq=10.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.1709182 = fieldWeight in 970, product of:
                  3.1622777 = tf(freq=10.0), with freq of:
                    10.0 = termFreq=10.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046875 = fieldNorm(doc=970)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
    Type
    a
  6. Carrillo-de-Albornoz, J.; Plaza, L.: ¬An emotion-based model of negation, intensifiers, and modality for polarity and intensity classification (2013) 0.00
    0.0020106873 = product of:
      0.0040213745 = sum of:
        0.0040213745 = product of:
          0.008042749 = sum of:
            0.008042749 = weight(_text_:a in 1005) [ClassicSimilarity], result of:
              0.008042749 = score(doc=1005,freq=14.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.1685276 = fieldWeight in 1005, product of:
                  3.7416575 = tf(freq=14.0), with freq of:
                    14.0 = termFreq=14.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1005)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Negation, intensifiers, and modality are common linguistic constructions that may modify the emotional meaning of the text and therefore need to be taken into consideration in sentiment analysis. Negation is usually considered as a polarity shifter, whereas intensifiers are regarded as amplifiers or diminishers of the strength of such polarity. Modality, in turn, has only been addressed in a very naïve fashion, so that modal forms are treated as polarity blockers. However, processing these constructions as mere polarity modifiers may be adequate for polarity classification, but it is not enough for more complex tasks (e.g., intensity classification), for which a more fine-grained model based on emotions is needed. In this work, we study the effect of modifiers on the emotions affected by them and propose a model of negation, intensifiers, and modality especially conceived for sentiment analysis tasks. We compare our emotion-based strategy with two traditional approaches based on polar expressions and find that representing the text as a set of emotions increases accuracy in different classification tasks and that this representation allows for a more accurate modeling of modifiers that results in further classification improvements. We also study the most common uses of modifiers in opinionated texts and quantify their impact in polarity and intensity classification. Finally, we analyze the joint effect of emotional modifiers and find that interesting synergies exist between them.
    Type
    a
  7. Keselman, A.; Rosemblat, G.; Kilicoglu, H.; Fiszman, M.; Jin, H.; Shin, D.; Rindflesch, T.C.: Adapting semantic natural language processing technology to address information overload in influenza epidemic management (2010) 0.00
    0.0020106873 = product of:
      0.0040213745 = sum of:
        0.0040213745 = product of:
          0.008042749 = sum of:
            0.008042749 = weight(_text_:a in 1312) [ClassicSimilarity], result of:
              0.008042749 = score(doc=1312,freq=14.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.1685276 = fieldWeight in 1312, product of:
                  3.7416575 = tf(freq=14.0), with freq of:
                    14.0 = termFreq=14.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1312)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    The explosion of disaster health information results in information overload among response professionals. The objective of this project was to determine the feasibility of applying semantic natural language processing (NLP) technology to addressing this overload. The project characterizes concepts and relationships commonly used in disaster health-related documents on influenza pandemics, as the basis for adapting an existing semantic summarizer to the domain. Methods include human review and semantic NLP analysis of a set of relevant documents. This is followed by a pilot test in which two information specialists use the adapted application for a realistic information-seeking task. According to the results, the ontology of influenza epidemics management can be described via a manageable number of semantic relationships that involve concepts from a limited number of semantic types. Test users demonstrate several ways to engage with the application to obtain useful information. This suggests that existing semantic NLP algorithms can be adapted to support information summarization and visualization in influenza epidemics and other disaster health areas. However, additional research is needed in the areas of terminology development (as many relevant relationships and terms are not part of existing standardized vocabularies), NLP, and user interface design.
    Type
    a
  8. Savoy, J.: Text representation strategies : an example with the State of the union addresses (2016) 0.00
    0.0020106873 = product of:
      0.0040213745 = sum of:
        0.0040213745 = product of:
          0.008042749 = sum of:
            0.008042749 = weight(_text_:a in 3042) [ClassicSimilarity], result of:
              0.008042749 = score(doc=3042,freq=14.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.1685276 = fieldWeight in 3042, product of:
                  3.7416575 = tf(freq=14.0), with freq of:
                    14.0 = termFreq=14.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3042)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Based on State of the Union addresses from 1790 to 2014 (225 speeches delivered by 42 presidents), this paper describes and evaluates different text representation strategies. To determine the most important words of a given text, the term frequencies (tf) or the tf?idf weighting scheme can be applied. Recently, latent Dirichlet allocation (LDA) has been proposed to define the topics included in a corpus. As another strategy, this study proposes to apply a vocabulary specificity measure (Z?score) to determine the most significantly overused word-types or short sequences of them. Our experiments show that the simple term frequency measure is not able to discriminate between specific terms associated with a document or a set of texts. Using the tf idf or LDA approach, the selection requires some arbitrary decisions. Based on the term-specific measure (Z?score), the term selection has a clear theoretical basis. Moreover, the most significant sentences for each presidency can be determined. As another facet, we can visualize the dynamic evolution of usage of some terms associated with their specificity measures. Finally, this technique can be employed to define the most important lexical leaders introducing terms overused by the k following presidencies.
    Type
    a
  9. Collovini de Abreu, S.; Vieira, R.: RelP: Portuguese open relation extraction (2017) 0.00
    0.0020106873 = product of:
      0.0040213745 = sum of:
        0.0040213745 = product of:
          0.008042749 = sum of:
            0.008042749 = weight(_text_:a in 3621) [ClassicSimilarity], result of:
              0.008042749 = score(doc=3621,freq=14.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.1685276 = fieldWeight in 3621, product of:
                  3.7416575 = tf(freq=14.0), with freq of:
                    14.0 = termFreq=14.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3621)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Natural language texts are valuable data sources in many human activities. NLP techniques are being widely used in order to help find the right information to specific needs. In this paper, we present one such technique: relation extraction from texts. This task aims at identifying and classifying semantic relations that occur between entities in a text. For example, the sentence "Roberto Marinho is the founder of Rede Globo" expresses a relation occurring between "Roberto Marinho" and "Rede Globo." This work presents a system for Portuguese Open Relation Extraction, named RelP, which extracts any relation descriptor that describes an explicit relation between named entities in the organisation domain by applying the Conditional Random Fields. For implementing RelP, we define the representation scheme, features based on previous work, and a reference corpus. RelP achieved state of the art results for open relation extraction; the F-measure rate was around 60% between the named entities person, organisation and place. For better understanding of the output, we present a way for organizing the output from the mining of the extracted relation descriptors. This organization can be useful to classify relation types, to cluster the entities involved in a common relation and to populate datasets.
    Type
    a
  10. Karlova-Bourbonus, N.: Automatic detection of contradictions in texts (2018) 0.00
    0.0018800576 = product of:
      0.0037601152 = sum of:
        0.0037601152 = product of:
          0.0075202305 = sum of:
            0.0075202305 = weight(_text_:a in 5976) [ClassicSimilarity], result of:
              0.0075202305 = score(doc=5976,freq=34.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.15757877 = fieldWeight in 5976, product of:
                  5.8309517 = tf(freq=34.0), with freq of:
                    34.0 = termFreq=34.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0234375 = fieldNorm(doc=5976)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Natural language contradictions are of complex nature. As will be shown in Chapter 5, the realization of contradictions is not limited to the examples such as Socrates is a man and Socrates is not a man (under the condition that Socrates refers to the same object in the real world), which is discussed by Aristotle (Section 3.1.1). Empirical evidence (see Chapter 5 for more details) shows that only a few contradictions occurring in the real life are of that explicit (prototypical) kind. Rather, con-tradictions make use of a variety of natural language devices such as, e.g., paraphrasing, synonyms and antonyms, passive and active voice, diversity of negation expression, and figurative linguistic means such as idioms, irony, and metaphors. Additionally, the most so-phisticated kind of contradictions, the so-called implicit contradictions, can be found only when applying world knowledge and after conducting a sequence of logical operations such as e.g. in: (1.1) The first prize was given to the experienced grandmaster L. Stein who, in total, col-lected ten points (7 wins and 3 draws). Those familiar with the chess rules know that a chess player gets one point for winning and zero points for losing the game. In case of a draw, each player gets a half point. Built on this idea and by conducting some simple mathematical operations, we can infer that in the case of 7 wins and 3 draws (the second part of the sentence), a player can only collect 8.5 points and not 10 points. Hence, we observe that there is a contradiction between the first and the second parts of the sentence.
    Implicit contradictions will only partially be the subject of the present study, aiming primarily at identifying the realization mechanism and cues (Chapter 5) as well as finding the parts of contradictions by applying the state of the art algorithms for natural language processing without conducting deep meaning processing. Further in focus are the explicit and implicit contradictions that can be detected by means of explicit linguistic, structural, lexical cues, and by conducting some additional processing operations (e.g., counting the sum in order to detect contradictions arising from numerical divergencies). One should note that an additional complexity in finding contradictions can arise in case parts of the contradictions occur on different levels of realization. Thus, a contradiction can be observed on the word- and phrase-level, such as in a married bachelor (for variations of contradictions on lexical level, see Ganeev 2004), on the sentence level - between parts of a sentence or between two or more sentences, or on the text level - between the portions of a text or between the whole texts such as a contradiction between the Bible and the Quran, for example. Only contradictions arising at the level of single sentences occurring in one or more texts, as well as parts of a sentence, will be considered for the purpose of this study. Though the focus of interest will be on single sentences, it will make use of text particularities such as coreference resolution without establishing the referents in the real world. Finally, another aspect to be considered is that parts of the contradictions are not neces-sarily to appear at the same time. They can be separated by many years and centuries with or without time expression making their recognition by human and detection by machine challenging. According to Aristotle's ontological version of the LNC (Section 3.1.1), how-ever, the same time reference is required in order for two statements to be judged as a contradiction. Taking this into account, we set the borders for the study by limiting the ana-lyzed textual data thematically (only nine world events) and temporally (three days after the reported event had happened) (Section 5.1). No sophisticated time processing will thus be conducted.
  11. Fegley, B.D.; Torvik, V.I.: On the role of poetic versus nonpoetic features in "kindred" and diachronic poetry attribution (2012) 0.00
    0.0018615347 = product of:
      0.0037230693 = sum of:
        0.0037230693 = product of:
          0.0074461387 = sum of:
            0.0074461387 = weight(_text_:a in 488) [ClassicSimilarity], result of:
              0.0074461387 = score(doc=488,freq=12.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.15602624 = fieldWeight in 488, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=488)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Author attribution studies have demonstrated remarkable success in applying orthographic and lexicographic features of text in a variety of discrimination problems. What might poetic features, such as syllabic stress and mood, contribute? We address this question in the context of two different attribution problems: (a) kindred: differentiate Langston Hughes' early poems from those of kindred poets and (b) diachronic: differentiate Hughes' early from his later poems. Using a diverse set of 535 generic text features, each categorized as poetic or nonpoetic, correlation-based greedy forward search ranked the features and a support vector machine classified the poems. A small subset of features (~10) achieved cross-validated precision and recall as high as 87%. Poetic features (rhyme patterns particularly) were nearly as effective as nonpoetic in kindred discrimination, but less effective diachronically. In other words, Hughes used both poetic and nonpoetic features in distinctive ways and his use of nonpoetic features evolved systematically while he continued to experiment with poetic features. These findings affirm qualitative studies attesting to structural elements from Black oral tradition and Black folk music (blues) and to the internal consistency of Hughes' early poetry.
    Type
    a
  12. Hoenkamp, E.; Bruza, P.: How everyday language can and will boost effective information retrieval (2015) 0.00
    0.0018615347 = product of:
      0.0037230693 = sum of:
        0.0037230693 = product of:
          0.0074461387 = sum of:
            0.0074461387 = weight(_text_:a in 2123) [ClassicSimilarity], result of:
              0.0074461387 = score(doc=2123,freq=12.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.15602624 = fieldWeight in 2123, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2123)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Typing 2 or 3 keywords into a browser has become an easy and efficient way to find information. Yet, typing even short queries becomes tedious on ever shrinking (virtual) keyboards. Meanwhile, speech processing is maturing rapidly, facilitating everyday language input. Also, wearable technology can inform users proactively by listening in on their conversations or processing their social media interactions. Given these developments, everyday language may soon become the new input of choice. We present an information retrieval (IR) algorithm specifically designed to accept everyday language. It integrates two paradigms of information retrieval, previously studied in isolation; one directed mainly at the surface structure of language, the other primarily at the underlying meaning. The integration was achieved by a Markov machine that encodes meaning by its transition graph, and surface structure by the language it generates. A rigorous evaluation of the approach showed, first, that it can compete with the quality of existing language models, second, that it is more effective the more verbose the input, and third, as a consequence, that it is promising for an imminent transition from keyword input, where the onus is on the user to formulate concise queries, to a modality where users can express more freely, more informal, and more natural their need for information in everyday language.
    Type
    a
  13. Brychcín, T.; Konopík, M.: HPS: High precision stemmer (2015) 0.00
    0.0018615347 = product of:
      0.0037230693 = sum of:
        0.0037230693 = product of:
          0.0074461387 = sum of:
            0.0074461387 = weight(_text_:a in 2686) [ClassicSimilarity], result of:
              0.0074461387 = score(doc=2686,freq=12.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.15602624 = fieldWeight in 2686, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2686)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Research into unsupervised ways of stemming has resulted, in the past few years, in the development of methods that are reliable and perform well. Our approach further shifts the boundaries of the state of the art by providing more accurate stemming results. The idea of the approach consists in building a stemmer in two stages. In the first stage, a stemming algorithm based upon clustering, which exploits the lexical and semantic information of words, is used to prepare large-scale training data for the second-stage algorithm. The second-stage algorithm uses a maximum entropy classifier. The stemming-specific features help the classifier decide when and how to stem a particular word. In our research, we have pursued the goal of creating a multi-purpose stemming tool. Its design opens up possibilities of solving non-traditional tasks such as approximating lemmas or improving language modeling. However, we still aim at very good results in the traditional task of information retrieval. The conducted tests reveal exceptional performance in all the above mentioned tasks. Our stemming method is compared with three state-of-the-art statistical algorithms and one rule-based algorithm. We used corpora in the Czech, Slovak, Polish, Hungarian, Spanish and English languages. In the tests, our algorithm excels in stemming previously unseen words (the words that are not present in the training set). Moreover, it was discovered that our approach demands very little text data for training when compared with competing unsupervised algorithms.
    Type
    a
  14. Sankarasubramaniam, Y.; Ramanathan, K.; Ghosh, S.: Text summarization using Wikipedia (2014) 0.00
    0.0018615347 = product of:
      0.0037230693 = sum of:
        0.0037230693 = product of:
          0.0074461387 = sum of:
            0.0074461387 = weight(_text_:a in 2693) [ClassicSimilarity], result of:
              0.0074461387 = score(doc=2693,freq=12.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.15602624 = fieldWeight in 2693, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2693)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Automatic text summarization has been an active field of research for many years. Several approaches have been proposed, ranging from simple position and word-frequency methods, to learning and graph based algorithms. The advent of human-generated knowledge bases like Wikipedia offer a further possibility in text summarization - they can be used to understand the input text in terms of salient concepts from the knowledge base. In this paper, we study a novel approach that leverages Wikipedia in conjunction with graph-based ranking. Our approach is to first construct a bipartite sentence-concept graph, and then rank the input sentences using iterative updates on this graph. We consider several models for the bipartite graph, and derive convergence properties under each model. Then, we take up personalized and query-focused summarization, where the sentence ranks additionally depend on user interests and queries, respectively. Finally, we present a Wikipedia-based multi-document summarization algorithm. An important feature of the proposed algorithms is that they enable real-time incremental summarization - users can first view an initial summary, and then request additional content if interested. We evaluate the performance of our proposed summarizer using the ROUGE metric, and the results show that leveraging Wikipedia can significantly improve summary quality. We also present results from a user study, which suggests that using incremental summarization can help in better understanding news articles.
    Type
    a
  15. Lhadj, L.S.; Boughanem, M.; Amrouche, K.: Enhancing information retrieval through concept-based language modeling and semantic smoothing (2016) 0.00
    0.0018615347 = product of:
      0.0037230693 = sum of:
        0.0037230693 = product of:
          0.0074461387 = sum of:
            0.0074461387 = weight(_text_:a in 3221) [ClassicSimilarity], result of:
              0.0074461387 = score(doc=3221,freq=12.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.15602624 = fieldWeight in 3221, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3221)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Traditionally, many information retrieval models assume that terms occur in documents independently. Although these models have already shown good performance, the word independency assumption seems to be unrealistic from a natural language point of view, which considers that terms are related to each other. Therefore, such an assumption leads to two well-known problems in information retrieval (IR), namely, polysemy, or term mismatch, and synonymy. In language models, these issues have been addressed by considering dependencies such as bigrams, phrasal-concepts, or word relationships, but such models are estimated using simple n-grams or concept counting. In this paper, we address polysemy and synonymy mismatch with a concept-based language modeling approach that combines ontological concepts from external resources with frequently found collocations from the document collection. In addition, the concept-based model is enriched with subconcepts and semantic relationships through a semantic smoothing technique so as to perform semantic matching. Experiments carried out on TREC collections show that our model achieves significant improvements over a single word-based model and the Markov Random Field model (using a Markov classifier).
    Type
    a
  16. Gill, A.J.; Hinrichs-Krapels, S.; Blanke, T.; Grant, J.; Hedges, M.; Tanner, S.: Insight workflow : systematically combining human and computational methods to explore textual data (2017) 0.00
    0.0018615347 = product of:
      0.0037230693 = sum of:
        0.0037230693 = product of:
          0.0074461387 = sum of:
            0.0074461387 = weight(_text_:a in 3682) [ClassicSimilarity], result of:
              0.0074461387 = score(doc=3682,freq=12.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.15602624 = fieldWeight in 3682, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3682)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Analyzing large quantities of real-world textual data has the potential to provide new insights for researchers. However, such data present challenges for both human and computational methods, requiring a diverse range of specialist skills, often shared across a number of individuals. In this paper we use the analysis of a real-world data set as our case study, and use this exploration as a demonstration of our "insight workflow," which we present for use and adaptation by other researchers. The data we use are impact case study documents collected as part of the UK Research Excellence Framework (REF), consisting of 6,679 documents and 6.25 million words; the analysis was commissioned by the Higher Education Funding Council for England (published as report HEFCE 2015). In our exploration and analysis we used a variety of techniques, ranging from keyword in context and frequency information to more sophisticated methods (topic modeling), with these automated techniques providing an empirical point of entry for in-depth and intensive human analysis. We present the 60 topics to demonstrate the output of our methods, and illustrate how the variety of analysis techniques can be combined to provide insights. We note potential limitations and propose future work.
    Type
    a
  17. Çelebi, A.; Özgür, A.: Segmenting hashtags and analyzing their grammatical structure (2018) 0.00
    0.0018615347 = product of:
      0.0037230693 = sum of:
        0.0037230693 = product of:
          0.0074461387 = sum of:
            0.0074461387 = weight(_text_:a in 4221) [ClassicSimilarity], result of:
              0.0074461387 = score(doc=4221,freq=12.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.15602624 = fieldWeight in 4221, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=4221)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Originated as a label to mark specific tweets, hashtags are increasingly used to convey messages that people like to see in the trending hashtags list. Complex noun phrases and even sentences can be turned into a hashtag. Breaking hashtags into their words is a challenging task due to the irregular and compact nature of the language used in Twitter. In this study, we investigate feature-based machine learning and language model (LM)-based approaches for hashtag segmentation. Our results show that LM alone is not successful at segmenting nontrivial hashtags. However, when the N-best LM-based segmentations are incorporated as features into the feature-based approach, along with context-based features proposed in this study, state-of-the-art results in hashtag segmentation are achieved. In addition, we provide an analysis of over two million distinct hashtags, autosegmented by using our best configuration. The analysis reveals that half of all 60 million hashtag occurrences contain multiple words and 80% of sentiment is trapped inside multiword hashtags, justifying the need for hashtag segmentation. Furthermore, we analyze the grammatical structure of hashtags by parsing them and observe that 77% of the hashtags are noun-based, whereas 11.9% are verb-based.
    Type
    a
  18. Muneer, I.; Sharjeel, M.; Iqbal, M.; Adeel Nawab, R.M.; Rayson, P.: CLEU - A Cross-language english-urdu corpus and benchmark for text reuse experiments (2019) 0.00
    0.0018615347 = product of:
      0.0037230693 = sum of:
        0.0037230693 = product of:
          0.0074461387 = sum of:
            0.0074461387 = weight(_text_:a in 5299) [ClassicSimilarity], result of:
              0.0074461387 = score(doc=5299,freq=12.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.15602624 = fieldWeight in 5299, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5299)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Text reuse is becoming a serious issue in many fields and research shows that it is much harder to detect when it occurs across languages. The recent rise in multi-lingual content on the Web has increased cross-language text reuse to an unprecedented scale. Although researchers have proposed methods to detect it, one major drawback is the unavailability of large-scale gold standard evaluation resources built on real cases. To overcome this problem, we propose a cross-language sentence/passage level text reuse corpus for the English-Urdu language pair. The Cross-Language English-Urdu Corpus (CLEU) has source text in English whereas the derived text is in Urdu. It contains in total 3,235 sentence/passage pairs manually tagged into three categories that is near copy, paraphrased copy, and independently written. Further, as a second contribution, we evaluate the Translation plus Mono-lingual Analysis method using three sets of experiments on the proposed dataset to highlight its usefulness. Evaluation results (f1=0.732 binary, f1=0.552 ternary classification) indicate that it is harder to detect cross-language real cases of text reuse, especially when the language pairs have unrelated scripts. The corpus is a useful benchmark resource for the future development and assessment of cross-language text reuse detection systems for the English-Urdu language pair.
    Type
    a
  19. Chen, L.; Fang, H.: ¬An automatic method for ex-tracting innovative ideas based on the Scopus® database (2019) 0.00
    0.0018615347 = product of:
      0.0037230693 = sum of:
        0.0037230693 = product of:
          0.0074461387 = sum of:
            0.0074461387 = weight(_text_:a in 5310) [ClassicSimilarity], result of:
              0.0074461387 = score(doc=5310,freq=12.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.15602624 = fieldWeight in 5310, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5310)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    The novelty of knowledge claims in a research paper can be considered an evaluation criterion for papers to supplement citations. To provide a foundation for research evaluation from the perspective of innovativeness, we propose an automatic approach for extracting innovative ideas from the abstracts of technology and engineering papers. The approach extracts N-grams as candidates based on part-of-speech tagging and determines whether they are novel by checking the Scopus® database to determine whether they had ever been presented previously. Moreover, we discussed the distributions of innovative ideas in different abstract structures. To improve the performance by excluding noisy N-grams, a list of stopwords and a list of research description characteristics were developed. We selected abstracts of articles published from 2011 to 2017 with the topic of semantic analysis as the experimental texts. Excluding noisy N-grams, considering the distribution of innovative ideas in abstracts, and suitably combining N-grams can effectively improve the performance of automatic innovative idea extraction. Unlike co-word and co-citation analysis, innovative-idea extraction aims to identify the differences in a paper from all previously published papers.
    Type
    a
  20. Stoykova, V.; Petkova, E.: Automatic extraction of mathematical terms for precalculus (2012) 0.00
    0.0018428253 = product of:
      0.0036856506 = sum of:
        0.0036856506 = product of:
          0.0073713013 = sum of:
            0.0073713013 = weight(_text_:a in 156) [ClassicSimilarity], result of:
              0.0073713013 = score(doc=156,freq=6.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.1544581 = fieldWeight in 156, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=156)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    In this work, we present the results of research for evaluating a methodology for extracting mathematical terms for precalculus using the techniques for semantically-oriented statistical search. We use the corpus-based approach and the combination of different statistically-based techniques for extracting keywords, collocations and co-occurrences incorporated in the Sketch Engine software. We evaluate the collocations candidate terms for the basic concept function(s) and approve the related methodology by precalculus domain conceptual terms definitions. Finally, we offer a conceptual terms hierarchical representation and discuss the results with respect to their possible applications.
    Type
    a

Languages

  • e 95
  • d 26
  • el 1
  • More… Less…

Types

  • a 106
  • el 25
  • m 5
  • x 5
  • s 2
  • More… Less…