Search (266 results, page 2 of 14)

  • × language_ss:"e"
  • × theme_ss:"Computerlinguistik"
  1. Rahmstorf, G.: Information retrieval using conceptual representations of phrases (1994) 0.03
    0.03466491 = product of:
      0.06932982 = sum of:
        0.043894395 = weight(_text_:data in 7862) [ClassicSimilarity], result of:
          0.043894395 = score(doc=7862,freq=4.0), product of:
            0.14807065 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046827413 = queryNorm
            0.29644224 = fieldWeight in 7862, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046875 = fieldNorm(doc=7862)
        0.025435425 = product of:
          0.05087085 = sum of:
            0.05087085 = weight(_text_:processing in 7862) [ClassicSimilarity], result of:
              0.05087085 = score(doc=7862,freq=2.0), product of:
                0.18956426 = queryWeight, product of:
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046827413 = queryNorm
                0.26835677 = fieldWeight in 7862, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046875 = fieldNorm(doc=7862)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    The information retrieval problem is described starting from an analysis of the concepts 'user's information request' and 'information offerings of texts'. It is shown that natural language phrases are a more adequate medium for expressing information requests and information offerings than character string based query and indexing languages complemented by Boolean oprators. The phrases must be represented as concepts to reach a language invariant level for rule based relevance analysis. The special type of representation called advanced thesaurus is used for the semantic representation of natural language phrases and for relevance processing. The analysis of the retrieval problem leads to a symmetric system structure
    Series
    Studies in classification, data analysis, and knowledge organization
    Source
    Information systems and data analysis: prospects - foundations - applications. Proc. of the 17th Annual Conference of the Gesellschaft für Klassifikation, Kaiserslautern, March 3-5, 1993. Ed.: H.-H. Bock et al
  2. Ingenerf, J.: Disambiguating lexical meaning : conceptual meta-modelling as a means of controlling semantic language analysis (1994) 0.03
    0.03466491 = product of:
      0.06932982 = sum of:
        0.043894395 = weight(_text_:data in 2572) [ClassicSimilarity], result of:
          0.043894395 = score(doc=2572,freq=4.0), product of:
            0.14807065 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046827413 = queryNorm
            0.29644224 = fieldWeight in 2572, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046875 = fieldNorm(doc=2572)
        0.025435425 = product of:
          0.05087085 = sum of:
            0.05087085 = weight(_text_:processing in 2572) [ClassicSimilarity], result of:
              0.05087085 = score(doc=2572,freq=2.0), product of:
                0.18956426 = queryWeight, product of:
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046827413 = queryNorm
                0.26835677 = fieldWeight in 2572, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2572)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    A formal terminology consists of a set of conceptual definitions for the semantical reconstruction of a vocabulary on an intensional level of description. The marking of comparatively abstract concepts as semantic categories and their relational positioning on a meta-level is shown to be instrumental in adapting the conceptual design to domain-specific characteristics. Such a meta-model implies that concepts subsumed by categories may share their compositional possibilities as regards the construction of complex structures. Our approach to language processing leads to an automatic derivation of contextual semantic information about the linguistic expressions under review. This information is encoded by means of values of certain attributes defined in a feature-based grammatical framework. A standard process controlling grammatical analysis, the unification of feature structures, is used for its evaluation. One important example for the usefulness of this approach is the disamgiguation of lexical meaning
    Series
    Studies in classification, data analysis, and knowledge organization
    Source
    Information systems and data analysis: prospects - foundations - applications. Proc. of the 17th Annual Conference of the Gesellschaft für Klassifikation, Kaiserslautern, March 3-5, 1993. Ed.: H.-H. Bock et al
  3. Driscoll, J.R.; Rajala, D.A.; Shaffer, W.H.: ¬The operation and performance of an artificially intelligent keywording system (1991) 0.03
    0.032942846 = product of:
      0.06588569 = sum of:
        0.036211025 = weight(_text_:data in 6681) [ClassicSimilarity], result of:
          0.036211025 = score(doc=6681,freq=2.0), product of:
            0.14807065 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046827413 = queryNorm
            0.24455236 = fieldWeight in 6681, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.0546875 = fieldNorm(doc=6681)
        0.029674664 = product of:
          0.05934933 = sum of:
            0.05934933 = weight(_text_:processing in 6681) [ClassicSimilarity], result of:
              0.05934933 = score(doc=6681,freq=2.0), product of:
                0.18956426 = queryWeight, product of:
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046827413 = queryNorm
                0.3130829 = fieldWeight in 6681, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=6681)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    Presents a new approach to text analysis for automating the key phrase indexing process, using artificial intelligence techniques. This mimics the behaviour of human experts by using a rule base consisting of insertion and deletion rules generated by subject-matter experts. The insertion rules are based on the idea that some phrases found in a text imply or trigger other phrases. The deletion rules apply to semantically ambiguous phrases where text presence alone does not determine appropriateness as a key phrase. The insertion and deletion rules are used to transform a list of found phrases to a list of key phrases for indexing a document. Statistical data are provided to demonstrate the performance of this expert rule based system
    Source
    Information processing and management. 27(1991) no.1, S.43-54
  4. Mock, K.J.; Vemuri, V.R.: Information filtering via hill climbing, WordNet, and index patterns (1997) 0.03
    0.032942846 = product of:
      0.06588569 = sum of:
        0.036211025 = weight(_text_:data in 1517) [ClassicSimilarity], result of:
          0.036211025 = score(doc=1517,freq=2.0), product of:
            0.14807065 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046827413 = queryNorm
            0.24455236 = fieldWeight in 1517, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1517)
        0.029674664 = product of:
          0.05934933 = sum of:
            0.05934933 = weight(_text_:processing in 1517) [ClassicSimilarity], result of:
              0.05934933 = score(doc=1517,freq=2.0), product of:
                0.18956426 = queryWeight, product of:
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046827413 = queryNorm
                0.3130829 = fieldWeight in 1517, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=1517)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    The INFOS (Intelligent News Filtering Organizational System) project is designed to reduce the user's search burden by automatically categorising data as relevant or irrelevant based upon user interests. These predictions are learned automatically based upon features taken from input articles and collaborative features derived from other users. The filtering is performed by a hybrid technique that combines elements of a keyword-based hill climbing method, knowledge-based conceptual representation via WordNet, and partial parsing via index patterns. The hybrid systems integrating all these approaches combines the benefits of each while maintaing robustness and acalability
    Source
    Information processing and management. 33(1997) no.5, S.633-644
  5. Schwarz, C.: THESYS: Thesaurus Syntax System : a fully automatic thesaurus building aid (1988) 0.03
    0.032085977 = product of:
      0.12834391 = sum of:
        0.12834391 = sum of:
          0.08393263 = weight(_text_:processing in 1361) [ClassicSimilarity], result of:
            0.08393263 = score(doc=1361,freq=4.0), product of:
              0.18956426 = queryWeight, product of:
                4.048147 = idf(docFreq=2097, maxDocs=44218)
                0.046827413 = queryNorm
              0.4427661 = fieldWeight in 1361, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.048147 = idf(docFreq=2097, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1361)
          0.044411276 = weight(_text_:22 in 1361) [ClassicSimilarity], result of:
            0.044411276 = score(doc=1361,freq=2.0), product of:
              0.16398162 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.046827413 = queryNorm
              0.2708308 = fieldWeight in 1361, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1361)
      0.25 = coord(1/4)
    
    Abstract
    THESYS is based on the natural language processing of free-text databases. It yields statistically evaluated correlations between words of the database. These correlations correspond to traditional thesaurus relations. The person who has to build a thesaurus is thus assisted by the proposals made by THESYS. THESYS is being tested on commercial databases under real world conditions. It is part of a text processing project at Siemens, called TINA (Text-Inhalts-Analyse). Software from TINA is actually being applied and evaluated by the US Department of Commerce for patent search and indexing (REALIST: REtrieval Aids by Linguistics and STatistics)
    Date
    6. 1.1999 10:22:07
  6. Ibekwe-SanJuan, F.; SanJuan, E.: From term variants to research topics (2002) 0.03
    0.028887425 = product of:
      0.05777485 = sum of:
        0.03657866 = weight(_text_:data in 1853) [ClassicSimilarity], result of:
          0.03657866 = score(doc=1853,freq=4.0), product of:
            0.14807065 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046827413 = queryNorm
            0.24703519 = fieldWeight in 1853, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1853)
        0.021196188 = product of:
          0.042392377 = sum of:
            0.042392377 = weight(_text_:processing in 1853) [ClassicSimilarity], result of:
              0.042392377 = score(doc=1853,freq=2.0), product of:
                0.18956426 = queryWeight, product of:
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046827413 = queryNorm
                0.22363065 = fieldWeight in 1853, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1853)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    In a scientific and technological watch (STW) task, an expert user needs to survey the evolution of research topics in his area of specialisation in order to detect interesting changes. The majority of methods proposing evaluation metrics (bibliometrics and scientometrics studies) for STW rely solely an statistical data analysis methods (Co-citation analysis, co-word analysis). Such methods usually work an structured databases where the units of analysis (words, keywords) are already attributed to documents by human indexers. The advent of huge amounts of unstructured textual data has rendered necessary the integration of natural language processing (NLP) techniques to first extract meaningful units from texts. We propose a method for STW which is NLP-oriented. The method not only analyses texts linguistically in order to extract terms from them, but also uses linguistic relations (syntactic variations) as the basis for clustering. Terms and variation relations are formalised as weighted di-graphs which the clustering algorithm, CPCL (Classification by Preferential Clustered Link) will seek to reduce in order to produces classes. These classes ideally represent the research topics present in the corpus. The results of the classification are subjected to validation by an expert in STW.
  7. Chandrasekar, R.; Bangalore, S.: Glean : using syntactic information in document filtering (2002) 0.03
    0.028887425 = product of:
      0.05777485 = sum of:
        0.03657866 = weight(_text_:data in 4257) [ClassicSimilarity], result of:
          0.03657866 = score(doc=4257,freq=4.0), product of:
            0.14807065 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046827413 = queryNorm
            0.24703519 = fieldWeight in 4257, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4257)
        0.021196188 = product of:
          0.042392377 = sum of:
            0.042392377 = weight(_text_:processing in 4257) [ClassicSimilarity], result of:
              0.042392377 = score(doc=4257,freq=2.0), product of:
                0.18956426 = queryWeight, product of:
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046827413 = queryNorm
                0.22363065 = fieldWeight in 4257, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=4257)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    In today's networked world, a huge amount of data is available in machine-processable form. Likewise, there are any number of search engines and specialized information retrieval (IR) programs that seek to extract relevant information from these data repositories. Most IR systems and Web search engines have been designed for speed and tend to maximize the quantity of information (recall) rather than the relevance of the information (precision) to the query. As a result, search engine users get inundated with information for practically any query, and are forced to scan a large number of potentially relevant items to get to the information of interest. The Holy Grail of IR is to somehow retrieve those and only those documents pertinent to the user's query. Polysemy and synonymy - the fact that often there are several meanings for a word or phrase, and likewise, many ways to express a conceptmake this a very hard task. While conventional IR systems provide usable solutions, there are a number of open problems to be solved, in areas such as syntactic processing, semantic analysis, and user modeling, before we develop systems that "understand" user queries and text collections. Meanwhile, we can use tools and techniques available today to improve the precision of retrieval. In particular, using the approach described in this article, we can approximate understanding using the syntactic structure and patterns of language use that is latent in documents to make IR more effective.
  8. Brychcín, T.; Konopík, M.: HPS: High precision stemmer (2015) 0.03
    0.028887425 = product of:
      0.05777485 = sum of:
        0.03657866 = weight(_text_:data in 2686) [ClassicSimilarity], result of:
          0.03657866 = score(doc=2686,freq=4.0), product of:
            0.14807065 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046827413 = queryNorm
            0.24703519 = fieldWeight in 2686, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2686)
        0.021196188 = product of:
          0.042392377 = sum of:
            0.042392377 = weight(_text_:processing in 2686) [ClassicSimilarity], result of:
              0.042392377 = score(doc=2686,freq=2.0), product of:
                0.18956426 = queryWeight, product of:
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046827413 = queryNorm
                0.22363065 = fieldWeight in 2686, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2686)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    Research into unsupervised ways of stemming has resulted, in the past few years, in the development of methods that are reliable and perform well. Our approach further shifts the boundaries of the state of the art by providing more accurate stemming results. The idea of the approach consists in building a stemmer in two stages. In the first stage, a stemming algorithm based upon clustering, which exploits the lexical and semantic information of words, is used to prepare large-scale training data for the second-stage algorithm. The second-stage algorithm uses a maximum entropy classifier. The stemming-specific features help the classifier decide when and how to stem a particular word. In our research, we have pursued the goal of creating a multi-purpose stemming tool. Its design opens up possibilities of solving non-traditional tasks such as approximating lemmas or improving language modeling. However, we still aim at very good results in the traditional task of information retrieval. The conducted tests reveal exceptional performance in all the above mentioned tasks. Our stemming method is compared with three state-of-the-art statistical algorithms and one rule-based algorithm. We used corpora in the Czech, Slovak, Polish, Hungarian, Spanish and English languages. In the tests, our algorithm excels in stemming previously unseen words (the words that are not present in the training set). Moreover, it was discovered that our approach demands very little text data for training when compared with competing unsupervised algorithms.
    Source
    Information processing and management. 51(2015) no.1, S.68-91
  9. Mustafa el Hadi, W.: Automatic term recognition & extraction tools : examining the new interfaces and their effective communication role in LSP discourse (1998) 0.03
    0.028236724 = product of:
      0.05647345 = sum of:
        0.031038022 = weight(_text_:data in 67) [ClassicSimilarity], result of:
          0.031038022 = score(doc=67,freq=2.0), product of:
            0.14807065 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046827413 = queryNorm
            0.2096163 = fieldWeight in 67, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046875 = fieldNorm(doc=67)
        0.025435425 = product of:
          0.05087085 = sum of:
            0.05087085 = weight(_text_:processing in 67) [ClassicSimilarity], result of:
              0.05087085 = score(doc=67,freq=2.0), product of:
                0.18956426 = queryWeight, product of:
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046827413 = queryNorm
                0.26835677 = fieldWeight in 67, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046875 = fieldNorm(doc=67)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    In this paper we will discuss the possibility of reorienting NLP (Natural Language Processing) systems towards the extraction, not only of terms and their semantic relations, but also towards a variety of other uses; the storage, accessing and retrieving of Language for Special Purposes (LSPZ-20) lexical combinations, the provision of contexts and other information on terms through the integration of more interfaces to terminological data-bases, term managing systems and existing NLP systems. The aim of making such interfaces available is to increase the efficiency of the systems and improve the terminology-oriented text analysis. Since automatic term extraction is the backbone of many applications such as machine translation (MT), indexing, technical writing, thesaurus construction and knowledge representation developments in this area will have asignificant impact
  10. Hess, M.: ¬An incrementally extensible document retrieval system based on linguistic and logical principles (1992) 0.03
    0.028236724 = product of:
      0.05647345 = sum of:
        0.031038022 = weight(_text_:data in 2413) [ClassicSimilarity], result of:
          0.031038022 = score(doc=2413,freq=2.0), product of:
            0.14807065 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046827413 = queryNorm
            0.2096163 = fieldWeight in 2413, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046875 = fieldNorm(doc=2413)
        0.025435425 = product of:
          0.05087085 = sum of:
            0.05087085 = weight(_text_:processing in 2413) [ClassicSimilarity], result of:
              0.05087085 = score(doc=2413,freq=2.0), product of:
                0.18956426 = queryWeight, product of:
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046827413 = queryNorm
                0.26835677 = fieldWeight in 2413, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2413)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    Most natural language based document retrieval systems use the syntax structures of constituent phrases of documents as index terms. Many of these systems also attempt to reduce the syntactic variability of natural language by some normalisation procedure applied to these syntax structures. However, the retrieval performance of such systems remains fairly disappointing. Some systems therefore use a meaning representation language to index and retrieve documents. In this paper, a system is presented that uses Horn Clause Logic as meaning representation language, employs advanced techniques from Natural Language Processing to achieve incremental extensibility, and uses methods from Logic Programming to achieve robustness in the face of insufficient data. An Incrementally Extensible Document Retrieval System Based on Linguistic and Logical Principles.
  11. Snajder, J.; Almic, P.: Modeling semantic compositionality of Croatian multiword expressions (2015) 0.03
    0.028236724 = product of:
      0.05647345 = sum of:
        0.031038022 = weight(_text_:data in 2920) [ClassicSimilarity], result of:
          0.031038022 = score(doc=2920,freq=2.0), product of:
            0.14807065 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046827413 = queryNorm
            0.2096163 = fieldWeight in 2920, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046875 = fieldNorm(doc=2920)
        0.025435425 = product of:
          0.05087085 = sum of:
            0.05087085 = weight(_text_:processing in 2920) [ClassicSimilarity], result of:
              0.05087085 = score(doc=2920,freq=2.0), product of:
                0.18956426 = queryWeight, product of:
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046827413 = queryNorm
                0.26835677 = fieldWeight in 2920, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2920)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    A distinguishing feature of many multiword expressions (MWEs) is their semantic non-compositionality. Determining the semantic compositionality of MWEs is important for many natural language processing tasks. We address the task of modeling semantic compositionality of Croatian MWEs. We adopt a composition-based approach within the distributional semantics framework. We build and evaluate models based on Latent Semantic Analysis and the recently proposed neural network-based Skip-gram model, and experiment with different composition functions. We show that the compositionality scores predicted by the Skip-gram additive models correlate well with human judgments (=0.50). When framed as a classification task, the model achieves an accuracy of 0.64.
    Content
    Vgl. unter: http://takelab.fer.hr/data/cromwesc/. The dataset is available from here: TakeLab-CroMWEsc.tar.gz. The archive contains one file, which contains a list of 200 Croatian multiword expressions annotated with semantic compositionality scores. Twenty expressions were annotated by 24 annotators (denoted by "*") and the rest of them were annotated by 6 annotators. Besides median, we provide mode, mean, and standard deviation for each expression. Consult the above mentioned paper for details.
  12. Belbachir, F.; Boughanem, M.: Using language models to improve opinion detection (2018) 0.03
    0.026398288 = product of:
      0.052796576 = sum of:
        0.035839625 = weight(_text_:data in 5044) [ClassicSimilarity], result of:
          0.035839625 = score(doc=5044,freq=6.0), product of:
            0.14807065 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046827413 = queryNorm
            0.24204408 = fieldWeight in 5044, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.03125 = fieldNorm(doc=5044)
        0.016956951 = product of:
          0.033913903 = sum of:
            0.033913903 = weight(_text_:processing in 5044) [ClassicSimilarity], result of:
              0.033913903 = score(doc=5044,freq=2.0), product of:
                0.18956426 = queryWeight, product of:
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046827413 = queryNorm
                0.17890452 = fieldWeight in 5044, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.03125 = fieldNorm(doc=5044)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    Opinion mining is one of the most important research tasks in the information retrieval research community. With the huge volume of opinionated data available on the Web, approaches must be developed to differentiate opinion from fact. In this paper, we present a lexicon-based approach for opinion retrieval. Generally, opinion retrieval consists of two stages: relevance to the query and opinion detection. In our work, we focus on the second state which itself focusses on detecting opinionated documents . We compare the document to be analyzed with opinionated sources that contain subjective information. We hypothesize that a document with a strong similarity to opinionated sources is more likely to be opinionated itself. Typical lexicon-based approaches treat and choose their opinion sources according to their test collection, then calculate the opinion score based on the frequency of subjective terms in the document. In our work, we use different open opinion collections without any specific treatment and consider them as a reference collection. We then use language models to determine opinion scores. The analysis document and reference collection are represented by different language models (i.e., Dirichlet, Jelinek-Mercer and two-stage models). These language models are generally used in information retrieval to represent the relationship between documents and queries. However, in our study, we modify these language models to represent opinionated documents. We carry out several experiments using Text REtrieval Conference (TREC) Blogs 06 as our analysis collection and Internet Movie Data Bases (IMDB), Multi-Perspective Question Answering (MPQA) and CHESLY as our reference collection. To improve opinion detection, we study the impact of using different language models to represent the document and reference collection alongside different combinations of opinion and retrieval scores. We then use this data to deduce the best opinion detection models. Using the best models, our approach improves on the best baseline of TREC Blog (baseline4) by 30%.
    Source
    Information processing and management. 54(2018) no.6, S.958-968
  13. Doszkocs, T.E.; Zamora, A.: Dictionary services and spelling aids for Web searching (2004) 0.03
    0.02620351 = product of:
      0.10481404 = sum of:
        0.10481404 = sum of:
          0.059951875 = weight(_text_:processing in 2541) [ClassicSimilarity], result of:
            0.059951875 = score(doc=2541,freq=4.0), product of:
              0.18956426 = queryWeight, product of:
                4.048147 = idf(docFreq=2097, maxDocs=44218)
                0.046827413 = queryNorm
              0.3162615 = fieldWeight in 2541, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.048147 = idf(docFreq=2097, maxDocs=44218)
                0.0390625 = fieldNorm(doc=2541)
          0.044862162 = weight(_text_:22 in 2541) [ClassicSimilarity], result of:
            0.044862162 = score(doc=2541,freq=4.0), product of:
              0.16398162 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.046827413 = queryNorm
              0.27358043 = fieldWeight in 2541, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.0390625 = fieldNorm(doc=2541)
      0.25 = coord(1/4)
    
    Abstract
    The Specialized Information Services Division (SIS) of the National Library of Medicine (NLM) provides Web access to more than a dozen scientific databases on toxicology and the environment on TOXNET . Search queries on TOXNET often include misspelled or variant English words, medical and scientific jargon and chemical names. Following the example of search engines like Google and ClinicalTrials.gov, we set out to develop a spelling "suggestion" system for increased recall and precision in TOXNET searching. This paper describes development of dictionary technology that can be used in a variety of applications such as orthographic verification, writing aid, natural language processing, and information storage and retrieval. The design of the technology allows building complex applications using the components developed in the earlier phases of the work in a modular fashion without extensive rewriting of computer code. Since many of the potential applications envisioned for this work have on-line or web-based interfaces, the dictionaries and other computer components must have fast response, and must be adaptable to open-ended database vocabularies, including chemical nomenclature. The dictionary vocabulary for this work was derived from SIS and other databases and specialized resources, such as NLM's Unified Medical Language Systems (UMLS) . The resulting technology, A-Z Dictionary (AZdict), has three major constituents: 1) the vocabulary list, 2) the word attributes that define part of speech and morphological relationships between words in the list, and 3) a set of programs that implements the retrieval of words and their attributes, and determines similarity between words (ChemSpell). These three components can be used in various applications such as spelling verification, spelling aid, part-of-speech tagging, paraphrasing, and many other natural language processing functions.
    Date
    14. 8.2004 17:22:56
    Source
    Online. 28(2004) no.3, S.22-29
  14. Godby, J.: WordSmith research project bridges gap between tokens and indexes (1998) 0.03
    0.02594015 = product of:
      0.1037606 = sum of:
        0.1037606 = sum of:
          0.05934933 = weight(_text_:processing in 4729) [ClassicSimilarity], result of:
            0.05934933 = score(doc=4729,freq=2.0), product of:
              0.18956426 = queryWeight, product of:
                4.048147 = idf(docFreq=2097, maxDocs=44218)
                0.046827413 = queryNorm
              0.3130829 = fieldWeight in 4729, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.048147 = idf(docFreq=2097, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4729)
          0.044411276 = weight(_text_:22 in 4729) [ClassicSimilarity], result of:
            0.044411276 = score(doc=4729,freq=2.0), product of:
              0.16398162 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.046827413 = queryNorm
              0.2708308 = fieldWeight in 4729, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4729)
      0.25 = coord(1/4)
    
    Abstract
    Reports on an OCLC natural language processing research project to develop methods for identifying terminology in unstructured electronic text, especially material associated with new cultural trends and emerging subjects. Current OCLC production software can only identify single words as indexable terms in full text documents, thus a major goal of the WordSmith project is to develop software that can automatically identify and intelligently organize phrases for uses in database indexes. By analyzing user terminology from local newspapers in the USA, the latest cultural trends and technical developments as well as personal and geographic names have been drawm out. Notes that this new vocabulary can also be mapped into reference works
    Source
    OCLC newsletter. 1998, no.234, Jul/Aug, S.22-24
  15. Rahmstorf, G.: Concept structures for large vocabularies (1998) 0.03
    0.025035713 = product of:
      0.050071426 = sum of:
        0.031038022 = weight(_text_:data in 75) [ClassicSimilarity], result of:
          0.031038022 = score(doc=75,freq=2.0), product of:
            0.14807065 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046827413 = queryNorm
            0.2096163 = fieldWeight in 75, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046875 = fieldNorm(doc=75)
        0.019033402 = product of:
          0.038066804 = sum of:
            0.038066804 = weight(_text_:22 in 75) [ClassicSimilarity], result of:
              0.038066804 = score(doc=75,freq=2.0), product of:
                0.16398162 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.046827413 = queryNorm
                0.23214069 = fieldWeight in 75, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.046875 = fieldNorm(doc=75)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    A technology is described which supports the acquisition, visualisation and manipulation of large vocabularies with associated structures. It is used for dictionary production, terminology data bases, thesauri, library classification systems etc. Essential features of the technology are a lexicographic user interface, variable word description, unlimited list of word readings, a concept language, automatic transformations of formulas into graphic structures, structure manipulation operations and retransformation into formulas. The concept language includes notations for undefined concepts. The structure of defined concepts can be constructed interactively. The technology supports the generation of large vocabularies with structures representing word senses. Concept structures and ordering systems for indexing and retrieval can be constructed separately and connected by associating relations.
    Date
    30.12.2001 19:01:22
  16. ¬The semantics of relationships : an interdisciplinary perspective (2002) 0.02
    0.023530604 = product of:
      0.04706121 = sum of:
        0.02586502 = weight(_text_:data in 1430) [ClassicSimilarity], result of:
          0.02586502 = score(doc=1430,freq=2.0), product of:
            0.14807065 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046827413 = queryNorm
            0.17468026 = fieldWeight in 1430, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1430)
        0.021196188 = product of:
          0.042392377 = sum of:
            0.042392377 = weight(_text_:processing in 1430) [ClassicSimilarity], result of:
              0.042392377 = score(doc=1430,freq=2.0), product of:
                0.18956426 = queryWeight, product of:
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046827413 = queryNorm
                0.22363065 = fieldWeight in 1430, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1430)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    Work on relationships takes place in many communities, including, among others, data modeling, knowledge representation, natural language processing, linguistics, and information retrieval. Unfortunately, continued disciplinary splintering and specialization keeps any one person from being familiar with the full expanse of that work. By including contributions form experts in a variety of disciplines and backgrounds, this volume demonstrates both the parallels that inform work on relationships across a number of fields and the singular emphases that have yet to be fully embraced, The volume is organized into 3 parts: (1) Types of relationships (2) Relationships in knowledge representation and reasoning (3) Applications of relationships
  17. Hmeidi, I.I.; Al-Shalabi, R.F.; Al-Taani, A.T.; Najadat, H.; Al-Hazaimeh, S.A.: ¬A novel approach to the extraction of roots from Arabic words using bigrams (2010) 0.02
    0.023530604 = product of:
      0.04706121 = sum of:
        0.02586502 = weight(_text_:data in 3426) [ClassicSimilarity], result of:
          0.02586502 = score(doc=3426,freq=2.0), product of:
            0.14807065 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046827413 = queryNorm
            0.17468026 = fieldWeight in 3426, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3426)
        0.021196188 = product of:
          0.042392377 = sum of:
            0.042392377 = weight(_text_:processing in 3426) [ClassicSimilarity], result of:
              0.042392377 = score(doc=3426,freq=2.0), product of:
                0.18956426 = queryWeight, product of:
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046827413 = queryNorm
                0.22363065 = fieldWeight in 3426, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3426)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    Root extraction is one of the most important topics in information retrieval (IR), natural language processing (NLP), text summarization, and many other important fields. In the last two decades, several algorithms have been proposed to extract Arabic roots. Most of these algorithms dealt with triliteral roots only, and some with fixed length words only. In this study, a novel approach to the extraction of roots from Arabic words using bigrams is proposed. Two similarity measures are used, the dissimilarity measure called the Manhattan distance, and Dice's measure of similarity. The proposed algorithm is tested on the Holy Qu'ran and on a corpus of 242 abstracts from the Proceedings of the Saudi Arabian National Computer Conferences. The two files used contain a wide range of data: the Holy Qu'ran contains most of the ancient Arabic words while the other file contains some modern Arabic words and some words borrowed from foreign languages in addition to the original Arabic words. The results of this study showed that combining N-grams with the Dice measure gives better results than using the Manhattan distance measure.
  18. Andrushchenko, M.; Sandberg, K.; Turunen, R.; Marjanen, J.; Hatavara, M.; Kurunmäki, J.; Nummenmaa, T.; Hyvärinen, M.; Teräs, K.; Peltonen, J.; Nummenmaa, J.: Using parsed and annotated corpora to analyze parliamentarians' talk in Finland (2022) 0.02
    0.023530604 = product of:
      0.04706121 = sum of:
        0.02586502 = weight(_text_:data in 471) [ClassicSimilarity], result of:
          0.02586502 = score(doc=471,freq=2.0), product of:
            0.14807065 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046827413 = queryNorm
            0.17468026 = fieldWeight in 471, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.0390625 = fieldNorm(doc=471)
        0.021196188 = product of:
          0.042392377 = sum of:
            0.042392377 = weight(_text_:processing in 471) [ClassicSimilarity], result of:
              0.042392377 = score(doc=471,freq=2.0), product of:
                0.18956426 = queryWeight, product of:
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046827413 = queryNorm
                0.22363065 = fieldWeight in 471, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=471)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    We present a search system for grammatically analyzed corpora of Finnish parliamentary records and interviews with former parliamentarians, annotated with metadata of talk structure and involved parliamentarians, and discuss their use through carefully chosen digital humanities case studies. We first introduce the construction, contents, and principles of use of the corpora. Then we discuss the application of the search system and the corpora to study how politicians talk about power, how ideological terms are used in political speech, and how to identify narratives in the data. All case studies stem from questions in the humanities and the social sciences, but rely on the grammatically parsed corpora in both identifying and quantifying passages of interest. Finally, the paper discusses the role of natural language processing methods for questions in the (digital) humanities. It makes the claim that a digital humanities inquiry of parliamentary speech and interviews with politicians cannot only rely on computational humanities modeling, but needs to accommodate a range of perspectives starting with simple searches, quantitative exploration, and ending with modeling. Furthermore, the digital humanities need a more thorough discussion about how the utilization of tools from information science and technologies alter the research questions posed in the humanities.
  19. Suissa, O.; Elmalech, A.; Zhitomirsky-Geffet, M.: Text analysis using deep neural networks in digital humanities and information science (2022) 0.02
    0.023530604 = product of:
      0.04706121 = sum of:
        0.02586502 = weight(_text_:data in 491) [ClassicSimilarity], result of:
          0.02586502 = score(doc=491,freq=2.0), product of:
            0.14807065 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046827413 = queryNorm
            0.17468026 = fieldWeight in 491, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.0390625 = fieldNorm(doc=491)
        0.021196188 = product of:
          0.042392377 = sum of:
            0.042392377 = weight(_text_:processing in 491) [ClassicSimilarity], result of:
              0.042392377 = score(doc=491,freq=2.0), product of:
                0.18956426 = queryWeight, product of:
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046827413 = queryNorm
                0.22363065 = fieldWeight in 491, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=491)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    Combining computational technologies and humanities is an ongoing effort aimed at making resources such as texts, images, audio, video, and other artifacts digitally available, searchable, and analyzable. In recent years, deep neural networks (DNN) dominate the field of automatic text analysis and natural language processing (NLP), in some cases presenting a super-human performance. DNNs are the state-of-the-art machine learning algorithms solving many NLP tasks that are relevant for Digital Humanities (DH) research, such as spell checking, language detection, entity extraction, author detection, question answering, and other tasks. These supervised algorithms learn patterns from a large number of "right" and "wrong" examples and apply them to new examples. However, using DNNs for analyzing the text resources in DH research presents two main challenges: (un)availability of training data and a need for domain adaptation. This paper explores these challenges by analyzing multiple use-cases of DH studies in recent literature and their possible solutions and lays out a practical decision model for DH experts for when and how to choose the appropriate deep learning approaches for their research. Moreover, in this paper, we aim to raise awareness of the benefits of utilizing deep learning models in the DH community.
  20. Laparra, E.; Binford-Walsh, A.; Emerson, K.; Miller, M.L.; López-Hoffman, L.; Currim, F.; Bethard, S.: Addressing structural hurdles for metadata extraction from environmental impact statements (2023) 0.02
    0.023530604 = product of:
      0.04706121 = sum of:
        0.02586502 = weight(_text_:data in 1042) [ClassicSimilarity], result of:
          0.02586502 = score(doc=1042,freq=2.0), product of:
            0.14807065 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046827413 = queryNorm
            0.17468026 = fieldWeight in 1042, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1042)
        0.021196188 = product of:
          0.042392377 = sum of:
            0.042392377 = weight(_text_:processing in 1042) [ClassicSimilarity], result of:
              0.042392377 = score(doc=1042,freq=2.0), product of:
                0.18956426 = queryWeight, product of:
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.046827413 = queryNorm
                0.22363065 = fieldWeight in 1042, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.048147 = idf(docFreq=2097, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1042)
          0.5 = coord(1/2)
      0.5 = coord(2/4)
    
    Abstract
    Natural language processing techniques can be used to analyze the linguistic content of a document to extract missing pieces of metadata. However, accurate metadata extraction may not depend solely on the linguistics, but also on structural problems such as extremely large documents, unordered multi-file documents, and inconsistency in manually labeled metadata. In this work, we start from two standard machine learning solutions to extract pieces of metadata from Environmental Impact Statements, environmental policy documents that are regularly produced under the US National Environmental Policy Act of 1969. We present a series of experiments where we evaluate how these standard approaches are affected by different issues derived from real-world data. We find that metadata extraction can be strongly influenced by nonlinguistic factors such as document length and volume ordering and that the standard machine learning solutions often do not scale well to long documents. We demonstrate how such solutions can be better adapted to these scenarios, and conclude with suggestions for other NLP practitioners cataloging large document collections.

Types

  • a 207
  • m 40
  • s 16
  • el 15
  • p 3
  • x 3
  • r 1
  • More… Less…

Subjects

Classifications