Search (79 results, page 1 of 4)

  • × theme_ss:"Computerlinguistik"
  • × type_ss:"a"
  • × year_i:[2010 TO 2020}
  1. Keselman, A.; Rosemblat, G.; Kilicoglu, H.; Fiszman, M.; Jin, H.; Shin, D.; Rindflesch, T.C.: Adapting semantic natural language processing technology to address information overload in influenza epidemic management (2010) 0.03
    0.031973958 = product of:
      0.047960933 = sum of:
        0.02495818 = weight(_text_:information in 1312) [ClassicSimilarity], result of:
          0.02495818 = score(doc=1312,freq=16.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.27429342 = fieldWeight in 1312, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1312)
        0.023002753 = product of:
          0.046005506 = sum of:
            0.046005506 = weight(_text_:management in 1312) [ClassicSimilarity], result of:
              0.046005506 = score(doc=1312,freq=4.0), product of:
                0.17470726 = queryWeight, product of:
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0518325 = queryNorm
                0.2633291 = fieldWeight in 1312, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1312)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    The explosion of disaster health information results in information overload among response professionals. The objective of this project was to determine the feasibility of applying semantic natural language processing (NLP) technology to addressing this overload. The project characterizes concepts and relationships commonly used in disaster health-related documents on influenza pandemics, as the basis for adapting an existing semantic summarizer to the domain. Methods include human review and semantic NLP analysis of a set of relevant documents. This is followed by a pilot test in which two information specialists use the adapted application for a realistic information-seeking task. According to the results, the ontology of influenza epidemics management can be described via a manageable number of semantic relationships that involve concepts from a limited number of semantic types. Test users demonstrate several ways to engage with the application to obtain useful information. This suggests that existing semantic NLP algorithms can be adapted to support information summarization and visualization in influenza epidemics and other disaster health areas. However, additional research is needed in the areas of terminology development (as many relevant relationships and terms are not part of existing standardized vocabularies), NLP, and user interface design.
    Source
    Journal of the American Society for Information Science and Technology. 61(2010) no.12, S.2531-2543
  2. Lu, K.; Cai, X.; Ajiferuke, I.; Wolfram, D.: Vocabulary size and its effect on topic representation (2017) 0.03
    0.025239285 = product of:
      0.037858926 = sum of:
        0.018340444 = weight(_text_:information in 3414) [ClassicSimilarity], result of:
          0.018340444 = score(doc=3414,freq=6.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.20156369 = fieldWeight in 3414, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.046875 = fieldNorm(doc=3414)
        0.019518482 = product of:
          0.039036963 = sum of:
            0.039036963 = weight(_text_:management in 3414) [ClassicSimilarity], result of:
              0.039036963 = score(doc=3414,freq=2.0), product of:
                0.17470726 = queryWeight, product of:
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0518325 = queryNorm
                0.22344214 = fieldWeight in 3414, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.046875 = fieldNorm(doc=3414)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    This study investigates how computational overhead for topic model training may be reduced by selectively removing terms from the vocabulary of text corpora being modeled. We compare the impact of removing singly occurring terms, the top 0.5%, 1% and 5% most frequently occurring terms and both top 0.5% most frequent and singly occurring terms, along with changes in the number of topics modeled (10, 20, 30, 40, 50, 100) using three datasets. Four outcome measures are compared. The removal of singly occurring terms has little impact on outcomes for all of the measures tested. Document discriminative capacity, as measured by the document space density, is reduced by the removal of frequently occurring terms, but increases with higher numbers of topics. Vocabulary size does not greatly influence entropy, but entropy is affected by the number of topics. Finally, topic similarity, as measured by pairwise topic similarity and Jensen-Shannon divergence, decreases with the removal of frequent terms. The findings have implications for information science research in information retrieval and informetrics that makes use of topic modeling.
    Source
    Information processing and management. 53(2017) no.3, S.653-665
  3. AL-Smadi, M.; Jaradat, Z.; AL-Ayyoub, M.; Jararweh, Y.: Paraphrase identification and semantic text similarity analysis in Arabic news tweets using lexical, syntactic, and semantic features (2017) 0.03
    0.025239285 = product of:
      0.037858926 = sum of:
        0.018340444 = weight(_text_:information in 5095) [ClassicSimilarity], result of:
          0.018340444 = score(doc=5095,freq=6.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.20156369 = fieldWeight in 5095, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.046875 = fieldNorm(doc=5095)
        0.019518482 = product of:
          0.039036963 = sum of:
            0.039036963 = weight(_text_:management in 5095) [ClassicSimilarity], result of:
              0.039036963 = score(doc=5095,freq=2.0), product of:
                0.17470726 = queryWeight, product of:
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0518325 = queryNorm
                0.22344214 = fieldWeight in 5095, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.046875 = fieldNorm(doc=5095)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    The rapid growth in digital information has raised considerable challenges in particular when it comes to automated content analysis. Social media such as twitter share a lot of its users' information about their events, opinions, personalities, etc. Paraphrase Identification (PI) is concerned with recognizing whether two texts have the same/similar meaning, whereas the Semantic Text Similarity (STS) is concerned with the degree of that similarity. This research proposes a state-of-the-art approach for paraphrase identification and semantic text similarity analysis in Arabic news tweets. The approach adopts several phases of text processing, features extraction and text classification. Lexical, syntactic, and semantic features are extracted to overcome the weakness and limitations of the current technologies in solving these tasks for the Arabic language. Maximum Entropy (MaxEnt) and Support Vector Regression (SVR) classifiers are trained using these features and are evaluated using a dataset prepared for this research. The experimentation results show that the approach achieves good results in comparison to the baseline results.
    Source
    Information processing and management. 53(2017) no.3, S.640-652
  4. Colace, F.; Santo, M. De; Greco, L.; Napoletano, P.: Weighted word pairs for query expansion (2015) 0.02
    0.023416823 = product of:
      0.035125233 = sum of:
        0.01235367 = weight(_text_:information in 2687) [ClassicSimilarity], result of:
          0.01235367 = score(doc=2687,freq=2.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.13576832 = fieldWeight in 2687, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2687)
        0.022771563 = product of:
          0.045543127 = sum of:
            0.045543127 = weight(_text_:management in 2687) [ClassicSimilarity], result of:
              0.045543127 = score(doc=2687,freq=2.0), product of:
                0.17470726 = queryWeight, product of:
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0518325 = queryNorm
                0.2606825 = fieldWeight in 2687, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=2687)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Source
    Information processing and management. 51(2015) no.1, S.179-193
  5. Fernández, R.T.; Losada, D.E.: Effective sentence retrieval based on query-independent evidence (2012) 0.02
    0.022995595 = product of:
      0.03449339 = sum of:
        0.014974909 = weight(_text_:information in 2728) [ClassicSimilarity], result of:
          0.014974909 = score(doc=2728,freq=4.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.16457605 = fieldWeight in 2728, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.046875 = fieldNorm(doc=2728)
        0.019518482 = product of:
          0.039036963 = sum of:
            0.039036963 = weight(_text_:management in 2728) [ClassicSimilarity], result of:
              0.039036963 = score(doc=2728,freq=2.0), product of:
                0.17470726 = queryWeight, product of:
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0518325 = queryNorm
                0.22344214 = fieldWeight in 2728, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2728)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    In this paper we propose an effective sentence retrieval method that consists of incorporating query-independent features into standard sentence retrieval models. To meet this aim, we apply a formal methodology and consider different query-independent features. In particular, we show that opinion-based features are promising. Opinion mining is an increasingly important research topic but little is known about how to improve retrieval algorithms with opinion-based components. In this respect, we consider here different kinds of opinion-based features to act as query-independent evidence and study whether this incorporation improves retrieval performance. On the other hand, information needs are usually related to people, locations or organizations. We hypothesize here that using these named entities as query-independent features may also improve the sentence relevance estimation. Finally, the length of the retrieval unit has been shown to be an important component in different retrieval scenarios. We therefore include length-based features in our study. Our evaluation demonstrates that, either in isolation or in combination, these query-independent features help to improve substantially the performance of state-of-the-art sentence retrieval methods.
    Source
    Information processing and management. 48(2012) no.6, S.1203-1229
  6. Clark, M.; Kim, Y.; Kruschwitz, U.; Song, D.; Albakour, D.; Dignum, S.; Beresi, U.C.; Fasli, M.; Roeck, A De: Automatically structuring domain knowledge from text : an overview of current research (2012) 0.02
    0.022995595 = product of:
      0.03449339 = sum of:
        0.014974909 = weight(_text_:information in 2738) [ClassicSimilarity], result of:
          0.014974909 = score(doc=2738,freq=4.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.16457605 = fieldWeight in 2738, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.046875 = fieldNorm(doc=2738)
        0.019518482 = product of:
          0.039036963 = sum of:
            0.039036963 = weight(_text_:management in 2738) [ClassicSimilarity], result of:
              0.039036963 = score(doc=2738,freq=2.0), product of:
                0.17470726 = queryWeight, product of:
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0518325 = queryNorm
                0.22344214 = fieldWeight in 2738, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2738)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    This paper presents an overview of automatic methods for building domain knowledge structures (domain models) from text collections. Applications of domain models have a long history within knowledge engineering and artificial intelligence. In the last couple of decades they have surfaced noticeably as a useful tool within natural language processing, information retrieval and semantic web technology. Inspired by the ubiquitous propagation of domain model structures that are emerging in several research disciplines, we give an overview of the current research landscape and some techniques and approaches. We will also discuss trade-offs between different approaches and point to some recent trends.
    Source
    Information processing and management. 48(2012) no.3, S.552-568
  7. Agarwal, B.; Ramampiaro, H.; Langseth, H.; Ruocco, M.: ¬A deep network model for paraphrase detection in short text messages (2018) 0.02
    0.022609001 = product of:
      0.0339135 = sum of:
        0.017648099 = weight(_text_:information in 5043) [ClassicSimilarity], result of:
          0.017648099 = score(doc=5043,freq=8.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.19395474 = fieldWeight in 5043, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5043)
        0.016265402 = product of:
          0.032530803 = sum of:
            0.032530803 = weight(_text_:management in 5043) [ClassicSimilarity], result of:
              0.032530803 = score(doc=5043,freq=2.0), product of:
                0.17470726 = queryWeight, product of:
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0518325 = queryNorm
                0.18620178 = fieldWeight in 5043, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5043)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    This paper is concerned with paraphrase detection, i.e., identifying sentences that are semantically identical. The ability to detect similar sentences written in natural language is crucial for several applications, such as text mining, text summarization, plagiarism detection, authorship authentication and question answering. Recognizing this importance, we study in particular how to address the challenges with detecting paraphrases in user generated short texts, such as Twitter, which often contain language irregularity and noise, and do not necessarily contain as much semantic information as longer clean texts. We propose a novel deep neural network-based approach that relies on coarse-grained sentence modelling using a convolutional neural network (CNN) and a recurrent neural network (RNN) model, combined with a specific fine-grained word-level similarity matching model. More specifically, we develop a new architecture, called DeepParaphrase, which enables to create an informative semantic representation of each sentence by (1) using CNN to extract the local region information in form of important n-grams from the sentence, and (2) applying RNN to capture the long-term dependency information. In addition, we perform a comparative study on state-of-the-art approaches within paraphrase detection. An important insight from this study is that existing paraphrase approaches perform well when applied on clean texts, but they do not necessarily deliver good performance against noisy texts, and vice versa. In contrast, our evaluation has shown that the proposed DeepParaphrase-based approach achieves good results in both types of texts, thus making it more robust and generic than the existing approaches.
    Source
    Information processing and management. 54(2018) no.6, S.922-937
  8. Lawrie, D.; Mayfield, J.; McNamee, P.; Oard, P.W.: Cross-language person-entity linking from 20 languages (2015) 0.02
    0.021104416 = product of:
      0.031656623 = sum of:
        0.01058886 = weight(_text_:information in 1848) [ClassicSimilarity], result of:
          0.01058886 = score(doc=1848,freq=2.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.116372846 = fieldWeight in 1848, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.046875 = fieldNorm(doc=1848)
        0.021067765 = product of:
          0.04213553 = sum of:
            0.04213553 = weight(_text_:22 in 1848) [ClassicSimilarity], result of:
              0.04213553 = score(doc=1848,freq=2.0), product of:
                0.18150859 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0518325 = queryNorm
                0.23214069 = fieldWeight in 1848, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.046875 = fieldNorm(doc=1848)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    The goal of entity linking is to associate references to an entity that is found in unstructured natural language content to an authoritative inventory of known entities. This article describes the construction of 6 test collections for cross-language person-entity linking that together span 22 languages. Fully automated components were used together with 2 crowdsourced validation stages to affordably generate ground-truth annotations with an accuracy comparable to that of a completely manual process. The resulting test collections each contain between 642 (Arabic) and 2,361 (Romanian) person references in non-English texts for which the correct resolution in English Wikipedia is known, plus a similar number of references for which no correct resolution into English Wikipedia is believed to exist. Fully automated cross-language person-name linking experiments with 20 non-English languages yielded a resolution accuracy of between 0.84 (Serbian) and 0.98 (Romanian), which compares favorably with previously reported cross-language entity linking results for Spanish.
    Source
    Journal of the Association for Information Science and Technology. 66(2015) no.6, S.1106-1123
  9. Brychcín, T.; Konopík, M.: HPS: High precision stemmer (2015) 0.02
    0.021032736 = product of:
      0.031549104 = sum of:
        0.015283704 = weight(_text_:information in 2686) [ClassicSimilarity], result of:
          0.015283704 = score(doc=2686,freq=6.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.16796975 = fieldWeight in 2686, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2686)
        0.016265402 = product of:
          0.032530803 = sum of:
            0.032530803 = weight(_text_:management in 2686) [ClassicSimilarity], result of:
              0.032530803 = score(doc=2686,freq=2.0), product of:
                0.17470726 = queryWeight, product of:
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0518325 = queryNorm
                0.18620178 = fieldWeight in 2686, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2686)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    Research into unsupervised ways of stemming has resulted, in the past few years, in the development of methods that are reliable and perform well. Our approach further shifts the boundaries of the state of the art by providing more accurate stemming results. The idea of the approach consists in building a stemmer in two stages. In the first stage, a stemming algorithm based upon clustering, which exploits the lexical and semantic information of words, is used to prepare large-scale training data for the second-stage algorithm. The second-stage algorithm uses a maximum entropy classifier. The stemming-specific features help the classifier decide when and how to stem a particular word. In our research, we have pursued the goal of creating a multi-purpose stemming tool. Its design opens up possibilities of solving non-traditional tasks such as approximating lemmas or improving language modeling. However, we still aim at very good results in the traditional task of information retrieval. The conducted tests reveal exceptional performance in all the above mentioned tasks. Our stemming method is compared with three state-of-the-art statistical algorithms and one rule-based algorithm. We used corpora in the Czech, Slovak, Polish, Hungarian, Spanish and English languages. In the tests, our algorithm excels in stemming previously unseen words (the words that are not present in the training set). Moreover, it was discovered that our approach demands very little text data for training when compared with competing unsupervised algorithms.
    Source
    Information processing and management. 51(2015) no.1, S.68-91
  10. Doko, A.; Stula, , M.; Seric, L.: Improved sentence retrieval using local context and sentence length (2013) 0.02
    0.020071562 = product of:
      0.030107342 = sum of:
        0.01058886 = weight(_text_:information in 2705) [ClassicSimilarity], result of:
          0.01058886 = score(doc=2705,freq=2.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.116372846 = fieldWeight in 2705, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.046875 = fieldNorm(doc=2705)
        0.019518482 = product of:
          0.039036963 = sum of:
            0.039036963 = weight(_text_:management in 2705) [ClassicSimilarity], result of:
              0.039036963 = score(doc=2705,freq=2.0), product of:
                0.17470726 = queryWeight, product of:
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0518325 = queryNorm
                0.22344214 = fieldWeight in 2705, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2705)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Source
    Information processing and management. 49(2013) no.6, S.1301-1312
  11. K., Vani; Gupta, D.: Unmasking text plagiarism using syntactic-semantic based natural language processing techniques : comparisons, analysis and challenges (2018) 0.02
    0.019162996 = product of:
      0.028744493 = sum of:
        0.01247909 = weight(_text_:information in 5084) [ClassicSimilarity], result of:
          0.01247909 = score(doc=5084,freq=4.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.13714671 = fieldWeight in 5084, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5084)
        0.016265402 = product of:
          0.032530803 = sum of:
            0.032530803 = weight(_text_:management in 5084) [ClassicSimilarity], result of:
              0.032530803 = score(doc=5084,freq=2.0), product of:
                0.17470726 = queryWeight, product of:
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0518325 = queryNorm
                0.18620178 = fieldWeight in 5084, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5084)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    The proposed work aims to explore and compare the potency of syntactic-semantic based linguistic structures in plagiarism detection using natural language processing techniques. The current work explores linguistic features, viz., part of speech tags, chunks and semantic roles in detecting plagiarized fragments and utilizes a combined syntactic-semantic similarity metric, which extracts the semantic concepts from WordNet lexical database. The linguistic information is utilized for effective pre-processing and for availing semantically relevant comparisons. Another major contribution is the analysis of the proposed approach on plagiarism cases of various complexity levels. The impact of plagiarism types and complexity levels, upon the features extracted is analyzed and discussed. Further, unlike the existing systems, which were evaluated on some limited data sets, the proposed approach is evaluated on a larger scale using the plagiarism corpus provided by PAN1 competition from 2009 to 2014. The approach presented considerable improvement in comparison with the top-ranked systems of the respective years. The evaluation and analysis with various cases of plagiarism also reflected the supremacy of deeper linguistic features for identifying manually plagiarized data.
    Source
    Information processing and management. 54(2018) no.3, S.408-432
  12. Belbachir, F.; Boughanem, M.: Using language models to improve opinion detection (2018) 0.02
    0.0180872 = product of:
      0.027130801 = sum of:
        0.01411848 = weight(_text_:information in 5044) [ClassicSimilarity], result of:
          0.01411848 = score(doc=5044,freq=8.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.1551638 = fieldWeight in 5044, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.03125 = fieldNorm(doc=5044)
        0.013012322 = product of:
          0.026024643 = sum of:
            0.026024643 = weight(_text_:management in 5044) [ClassicSimilarity], result of:
              0.026024643 = score(doc=5044,freq=2.0), product of:
                0.17470726 = queryWeight, product of:
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0518325 = queryNorm
                0.14896142 = fieldWeight in 5044, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.03125 = fieldNorm(doc=5044)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    Opinion mining is one of the most important research tasks in the information retrieval research community. With the huge volume of opinionated data available on the Web, approaches must be developed to differentiate opinion from fact. In this paper, we present a lexicon-based approach for opinion retrieval. Generally, opinion retrieval consists of two stages: relevance to the query and opinion detection. In our work, we focus on the second state which itself focusses on detecting opinionated documents . We compare the document to be analyzed with opinionated sources that contain subjective information. We hypothesize that a document with a strong similarity to opinionated sources is more likely to be opinionated itself. Typical lexicon-based approaches treat and choose their opinion sources according to their test collection, then calculate the opinion score based on the frequency of subjective terms in the document. In our work, we use different open opinion collections without any specific treatment and consider them as a reference collection. We then use language models to determine opinion scores. The analysis document and reference collection are represented by different language models (i.e., Dirichlet, Jelinek-Mercer and two-stage models). These language models are generally used in information retrieval to represent the relationship between documents and queries. However, in our study, we modify these language models to represent opinionated documents. We carry out several experiments using Text REtrieval Conference (TREC) Blogs 06 as our analysis collection and Internet Movie Data Bases (IMDB), Multi-Perspective Question Answering (MPQA) and CHESLY as our reference collection. To improve opinion detection, we study the impact of using different language models to represent the document and reference collection alongside different combinations of opinion and retrieval scores. We then use this data to deduce the best opinion detection models. Using the best models, our approach improves on the best baseline of TREC Blog (baseline4) by 30%.
    Source
    Information processing and management. 54(2018) no.6, S.958-968
  13. Gencosman, B.C.; Ozmutlu, H.C.; Ozmutlu, S.: Character n-gram application for automatic new topic identification (2014) 0.02
    0.0167263 = product of:
      0.02508945 = sum of:
        0.0088240495 = weight(_text_:information in 2688) [ClassicSimilarity], result of:
          0.0088240495 = score(doc=2688,freq=2.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.09697737 = fieldWeight in 2688, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2688)
        0.016265402 = product of:
          0.032530803 = sum of:
            0.032530803 = weight(_text_:management in 2688) [ClassicSimilarity], result of:
              0.032530803 = score(doc=2688,freq=2.0), product of:
                0.17470726 = queryWeight, product of:
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0518325 = queryNorm
                0.18620178 = fieldWeight in 2688, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2688)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Source
    Information processing and management. 50(2014) no.6, S.821-856
  14. Sankarasubramaniam, Y.; Ramanathan, K.; Ghosh, S.: Text summarization using Wikipedia (2014) 0.02
    0.0167263 = product of:
      0.02508945 = sum of:
        0.0088240495 = weight(_text_:information in 2693) [ClassicSimilarity], result of:
          0.0088240495 = score(doc=2693,freq=2.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.09697737 = fieldWeight in 2693, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2693)
        0.016265402 = product of:
          0.032530803 = sum of:
            0.032530803 = weight(_text_:management in 2693) [ClassicSimilarity], result of:
              0.032530803 = score(doc=2693,freq=2.0), product of:
                0.17470726 = queryWeight, product of:
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0518325 = queryNorm
                0.18620178 = fieldWeight in 2693, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2693)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Source
    Information processing and management. 50(2014) no.3, S.443-461
  15. Fang, L.; Tuan, L.A.; Hui, S.C.; Wu, L.: Syntactic based approach for grammar question retrieval (2018) 0.02
    0.0167263 = product of:
      0.02508945 = sum of:
        0.0088240495 = weight(_text_:information in 5086) [ClassicSimilarity], result of:
          0.0088240495 = score(doc=5086,freq=2.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.09697737 = fieldWeight in 5086, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5086)
        0.016265402 = product of:
          0.032530803 = sum of:
            0.032530803 = weight(_text_:management in 5086) [ClassicSimilarity], result of:
              0.032530803 = score(doc=5086,freq=2.0), product of:
                0.17470726 = queryWeight, product of:
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0518325 = queryNorm
                0.18620178 = fieldWeight in 5086, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.3706124 = idf(docFreq=4130, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5086)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Source
    Information processing and management. 54(2018) no.2, S.184-202
  16. Engerer, V.: Exploring interdisciplinary relationships between linguistics and information retrieval from the 1960s to today (2017) 0.01
    0.011161638 = product of:
      0.033484913 = sum of:
        0.033484913 = weight(_text_:information in 3434) [ClassicSimilarity], result of:
          0.033484913 = score(doc=3434,freq=20.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.36800325 = fieldWeight in 3434, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.046875 = fieldNorm(doc=3434)
      0.33333334 = coord(1/3)
    
    Abstract
    This article explores how linguistics has influenced information retrieval (IR) and attempts to explain the impact of linguistics through an analysis of internal developments in information science generally, and IR in particular. It notes that information science/IR has been evolving from a case science into a fully fledged, "disciplined"/disciplinary science. The article establishes correspondences between linguistics and information science/IR using the three established IR paradigms-physical, cognitive, and computational-as a frame of reference. The current relationship between information science/IR and linguistics is elucidated through discussion of some recent information science publications dealing with linguistic topics and a novel technique, "keyword collocation analysis," is introduced. Insights from interdisciplinarity research and case theory are also discussed. It is demonstrated that the three stages of interdisciplinarity, namely multidisciplinarity, interdisciplinarity (in the narrow sense), and transdisciplinarity, can be linked to different phases of the information science/IR-linguistics relationship and connected to different ways of using linguistic theory in information science and IR.
    Source
    Journal of the Association for Information Science and Technology. 68(2017) no.3, S.660-680
  17. Ko, Y.: ¬A new term-weighting scheme for text classification using the odds of positive and negative class probabilities (2015) 0.01
    0.009983273 = product of:
      0.029949818 = sum of:
        0.029949818 = weight(_text_:information in 2339) [ClassicSimilarity], result of:
          0.029949818 = score(doc=2339,freq=16.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.3291521 = fieldWeight in 2339, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.046875 = fieldNorm(doc=2339)
      0.33333334 = coord(1/3)
    
    Abstract
    Text classification (TC) is a core technique for text mining and information retrieval. It has been applied to many applications in many different research and industrial areas. Term-weighting schemes assign an appropriate weight to each term to obtain a high TC performance. Although term weighting is one of the important modules for TC and TC has different peculiarities from those in information retrieval, many term-weighting schemes used in information retrieval, such as term frequency-inverse document frequency (tf-idf), have been used in TC in the same manner. The peculiarity of TC that differs most from information retrieval is the existence of class information. This article proposes a new term-weighting scheme that uses class information using positive and negative class distributions. As a result, the proposed scheme, log tf-TRR, consistently performs better than do other schemes using class information as well as traditional schemes such as tf-idf.
    Source
    Journal of the Association for Information Science and Technology. 66(2015) no.12, S.2553-2565
  18. Kocijan, K.: Visualizing natural language resources (2015) 0.01
    0.008319394 = product of:
      0.02495818 = sum of:
        0.02495818 = weight(_text_:information in 2995) [ClassicSimilarity], result of:
          0.02495818 = score(doc=2995,freq=4.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.27429342 = fieldWeight in 2995, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.078125 = fieldNorm(doc=2995)
      0.33333334 = coord(1/3)
    
    Source
    Re:inventing information science in the networked society: Proceedings of the 14th International Symposium on Information Science, Zadar/Croatia, 19th-21st May 2015. Eds.: F. Pehar, C. Schloegl u. C. Wolff
  19. Babik, W.: Keywords as linguistic tools in information and knowledge organization (2017) 0.01
    0.0082357805 = product of:
      0.02470734 = sum of:
        0.02470734 = weight(_text_:information in 3510) [ClassicSimilarity], result of:
          0.02470734 = score(doc=3510,freq=8.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.27153665 = fieldWeight in 3510, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0546875 = fieldNorm(doc=3510)
      0.33333334 = coord(1/3)
    
    Source
    Theorie, Semantik und Organisation von Wissen: Proceedings der 13. Tagung der Deutschen Sektion der Internationalen Gesellschaft für Wissensorganisation (ISKO) und dem 13. Internationalen Symposium der Informationswissenschaft der Higher Education Association for Information Science (HI) Potsdam (19.-20.03.2013): 'Theory, Information and Organization of Knowledge' / Proceedings der 14. Tagung der Deutschen Sektion der Internationalen Gesellschaft für Wissensorganisation (ISKO) und Natural Language & Information Systems (NLDB) Passau (16.06.2015): 'Lexical Resources for Knowledge Organization' / Proceedings des Workshops der Deutschen Sektion der Internationalen Gesellschaft für Wissensorganisation (ISKO) auf der SEMANTICS Leipzig (1.09.2014): 'Knowledge Organization and Semantic Web' / Proceedings des Workshops der Polnischen und Deutschen Sektion der Internationalen Gesellschaft für Wissensorganisation (ISKO) Cottbus (29.-30.09.2011): 'Economics of Knowledge Production and Organization'. Hrsg. von W. Babik, H.P. Ohly u. K. Weber
  20. Rosemblat, G.; Resnick, M.P.; Auston, I.; Shin, D.; Sneiderman, C.; Fizsman, M.; Rindflesch, T.C.: Extending SemRep to the public health domain (2013) 0.01
    0.007892471 = product of:
      0.02367741 = sum of:
        0.02367741 = weight(_text_:information in 2096) [ClassicSimilarity], result of:
          0.02367741 = score(doc=2096,freq=10.0), product of:
            0.09099081 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0518325 = queryNorm
            0.2602176 = fieldWeight in 2096, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.046875 = fieldNorm(doc=2096)
      0.33333334 = coord(1/3)
    
    Abstract
    We describe the use of a domain-independent method to extend a natural language processing (NLP) application, SemRep (Rindflesch, Fiszman, & Libbus, 2005), based on the knowledge sources afforded by the Unified Medical Language System (UMLS®; Humphreys, Lindberg, Schoolman, & Barnett, 1998) to support the area of health promotion within the public health domain. Public health professionals require good information about successful health promotion policies and programs that might be considered for application within their own communities. Our effort seeks to improve access to relevant information for the public health profession, to help those in the field remain an information-savvy workforce. Natural language processing and semantic techniques hold promise to help public health professionals navigate the growing ocean of information by organizing and structuring this knowledge into a focused public health framework paired with a user-friendly visualization application as a way to summarize results of PubMed® searches in this field of knowledge.
    Source
    Journal of the American Society for Information Science and Technology. 64(2013) no.10, S.1963-1974

Languages

  • e 70
  • d 9