Search (102 results, page 2 of 6)

  • × theme_ss:"Computerlinguistik"
  1. Baayen, R.H.; Lieber, H.: Word frequency distributions and lexical semantics (1997) 0.02
    0.015707046 = product of:
      0.047121134 = sum of:
        0.047121134 = product of:
          0.09424227 = sum of:
            0.09424227 = weight(_text_:22 in 3117) [ClassicSimilarity], result of:
              0.09424227 = score(doc=3117,freq=2.0), product of:
                0.17398734 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.049684696 = queryNorm
                0.5416616 = fieldWeight in 3117, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.109375 = fieldNorm(doc=3117)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Date
    28. 2.1999 10:48:22
  2. ¬Der Student aus dem Computer (2023) 0.02
    0.015707046 = product of:
      0.047121134 = sum of:
        0.047121134 = product of:
          0.09424227 = sum of:
            0.09424227 = weight(_text_:22 in 1079) [ClassicSimilarity], result of:
              0.09424227 = score(doc=1079,freq=2.0), product of:
                0.17398734 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.049684696 = queryNorm
                0.5416616 = fieldWeight in 1079, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.109375 = fieldNorm(doc=1079)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Date
    27. 1.2023 16:22:55
  3. Fox, C.: Lexical analysis and stoplists (1992) 0.02
    0.015166845 = product of:
      0.045500536 = sum of:
        0.045500536 = product of:
          0.09100107 = sum of:
            0.09100107 = weight(_text_:indexing in 3502) [ClassicSimilarity], result of:
              0.09100107 = score(doc=3502,freq=4.0), product of:
                0.19018644 = queryWeight, product of:
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.049684696 = queryNorm
                0.47848347 = fieldWeight in 3502, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.0625 = fieldNorm(doc=3502)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    Lexical analysis is a fundamental operation in both query processing and automatic indexing, and filtering stoplist words is an important step in the automatic indexing process. Presents basic algorithms and data structures for lexical analysis, and shows how stoplist word removal can be efficiently incorporated into lexical analysis
  4. Frakes, W.B.: Stemming algorithms (1992) 0.02
    0.015166845 = product of:
      0.045500536 = sum of:
        0.045500536 = product of:
          0.09100107 = sum of:
            0.09100107 = weight(_text_:indexing in 3503) [ClassicSimilarity], result of:
              0.09100107 = score(doc=3503,freq=4.0), product of:
                0.19018644 = queryWeight, product of:
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.049684696 = queryNorm
                0.47848347 = fieldWeight in 3503, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.0625 = fieldNorm(doc=3503)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    Desribes stemming algorithms - programs that relate morphologically similar indexing and search terms. Stemming is used to improve retrieval effectiveness and to reduce the size of indexing files. Several approaches to stemming are describes - table lookup, affix removal, successor variety, and n-gram. empirical studies of stemming are summarized. The Porter stemmer is described in detail, and a full implementation in C is presented
  5. Sharada, B.A.: Rules derivation for Kannada based indexing language using transformational grammar (1998) 0.02
    0.015166845 = product of:
      0.045500536 = sum of:
        0.045500536 = product of:
          0.09100107 = sum of:
            0.09100107 = weight(_text_:indexing in 3533) [ClassicSimilarity], result of:
              0.09100107 = score(doc=3533,freq=4.0), product of:
                0.19018644 = queryWeight, product of:
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.049684696 = queryNorm
                0.47848347 = fieldWeight in 3533, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.0625 = fieldNorm(doc=3533)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    Discusses the importance of syntax in analysing document titles. In an natural language processing environment, suggests a rule for analysing the indexing language based on Kannada, one of the major Indian languages, using the principles of transformational grammar
  6. Mustafa el Hadi, W.: ¬The contribution of terminology to the theoretical conception of classificatory languages and document indexing (1990) 0.02
    0.015166845 = product of:
      0.045500536 = sum of:
        0.045500536 = product of:
          0.09100107 = sum of:
            0.09100107 = weight(_text_:indexing in 5273) [ClassicSimilarity], result of:
              0.09100107 = score(doc=5273,freq=4.0), product of:
                0.19018644 = queryWeight, product of:
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.049684696 = queryNorm
                0.47848347 = fieldWeight in 5273, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.0625 = fieldNorm(doc=5273)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    Demonstrates the contribution of indexing languages to the analysis of certain linguistic phenomena, and reciprocally, the contribution of linguistics to the analysis of semantic relationships used by documentalists in conceiving thesauri
  7. Ahlgren, P.; Kekäläinen, J.: Indexing strategies for Swedish full text retrieval under different user scenarios (2007) 0.01
    0.014988055 = product of:
      0.044964164 = sum of:
        0.044964164 = product of:
          0.08992833 = sum of:
            0.08992833 = weight(_text_:indexing in 896) [ClassicSimilarity], result of:
              0.08992833 = score(doc=896,freq=10.0), product of:
                0.19018644 = queryWeight, product of:
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.049684696 = queryNorm
                0.47284302 = fieldWeight in 896, product of:
                  3.1622777 = tf(freq=10.0), with freq of:
                    10.0 = termFreq=10.0
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=896)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    This paper deals with Swedish full text retrieval and the problem of morphological variation of query terms in the document database. The effects of combination of indexing strategies with query terms on retrieval effectiveness were studied. Three of five tested combinations involved indexing strategies that used conflation, in the form of normalization. Further, two of these three combinations used indexing strategies that employed compound splitting. Normalization and compound splitting were performed by SWETWOL, a morphological analyzer for the Swedish language. A fourth combination attempted to group related terms by right hand truncation of query terms. The four combinations were compared to each other and to a baseline combination, where no attempt was made to counteract the problem of morphological variation of query terms in the document database. The five combinations were evaluated under six different user scenarios, where each scenario simulated a certain user type. The four alternative combinations outperformed the baseline, for each user scenario. The truncation combination had the best performance under each user scenario. The main conclusion of the paper is that normalization and right hand truncation (performed by a search expert) enhanced retrieval effectiveness in comparison to the baseline. The performance of the three combinations of indexing strategies with query terms based on normalization was not far below the performance of the truncation combination.
  8. Fagan, J.L.: ¬The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval (1989) 0.01
    0.014988055 = product of:
      0.044964164 = sum of:
        0.044964164 = product of:
          0.08992833 = sum of:
            0.08992833 = weight(_text_:indexing in 1845) [ClassicSimilarity], result of:
              0.08992833 = score(doc=1845,freq=10.0), product of:
                0.19018644 = queryWeight, product of:
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.049684696 = queryNorm
                0.47284302 = fieldWeight in 1845, product of:
                  3.1622777 = tf(freq=10.0), with freq of:
                    10.0 = termFreq=10.0
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1845)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    It may be possible to improve the quality of automatic indexing systems by using complex descriptors, for example, phrases, in addition to the simple descriptors (words or word stems) that are normally used in automatically constructed representations of document content. This study is directed toward the goal of developing effective methods of identifying phrases in natural language text from which good quality phrase descriptors can be constructed. The effectiveness of one method, a simple nonsyntactic phrase indexing procedure, has been tested on five experimental document collections. The results have been analyzed in order to identify the inadequacies of the procedure, and to determine what kinds of information about text structure are needed in order to construct phrase descriptors that are good indicators of document content. Two primary conclusions have been reached: (1) In the retrieval experiments, the nonsyntactic phrase construction procedure did not consistently yield substantial improvements in effectiveness. It is therefore not likely that phrase indexing of this kind will prove to be an important method of enhancing the performance of automatic document indexing and retrieval systems in operational environments. (2) Many of the shortcomings of the nonsyntactic approach can be overcome by incorporating syntactic information into the phrase construction process. However, a general syntactic analysis facility may be required, since many useful sources of phrases cannot be exploited if only a limited inventory of syntactic patterns can be recognized. Further research should be conducted into methods of incorporating automatic syntactic analysis into content analysis for document retrieval.
  9. Ali, C.B.; Haddad, H.; Slimani, Y.: Multi-word terms selection for information retrieval (2022) 0.01
    0.014988055 = product of:
      0.044964164 = sum of:
        0.044964164 = product of:
          0.08992833 = sum of:
            0.08992833 = weight(_text_:indexing in 900) [ClassicSimilarity], result of:
              0.08992833 = score(doc=900,freq=10.0), product of:
                0.19018644 = queryWeight, product of:
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.049684696 = queryNorm
                0.47284302 = fieldWeight in 900, product of:
                  3.1622777 = tf(freq=10.0), with freq of:
                    10.0 = termFreq=10.0
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=900)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    Purpose A number of approaches and algorithms have been proposed over the years as a basis for automatic indexing. Many of these approaches suffer from precision inefficiency at low recall. The choice of indexing units has a great impact on search system effectiveness. The authors dive beyond simple terms indexing to propose a framework for multi-word terms (MWT) filtering and indexing. Design/methodology/approach In this paper, the authors rely on ranking MWT to filter them, keeping the most effective ones for the indexing process. The proposed model is based on filtering MWT according to their ability to capture the document topic and distinguish between different documents from the same collection. The authors rely on the hypothesis that the best MWT are those that achieve the greatest association degree. The experiments are carried out with English and French languages data sets. Findings The results indicate that this approach achieved precision enhancements at low recall, and it performed better than more advanced models based on terms dependencies. Originality/value Using and testing different association measures to select MWT that best describe the documents to enhance the precision in the first retrieved documents.
  10. Byrne, C.C.; McCracken, S.A.: ¬An adaptive thesaurus employing semantic distance, relational inheritance and nominal compound interpretation for linguistic support of information retrieval (1999) 0.01
    0.0134631805 = product of:
      0.04038954 = sum of:
        0.04038954 = product of:
          0.08077908 = sum of:
            0.08077908 = weight(_text_:22 in 4483) [ClassicSimilarity], result of:
              0.08077908 = score(doc=4483,freq=2.0), product of:
                0.17398734 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.049684696 = queryNorm
                0.46428138 = fieldWeight in 4483, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.09375 = fieldNorm(doc=4483)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Date
    15. 3.2000 10:22:37
  11. Boleda, G.; Evert, S.: Multiword expressions : a pain in the neck of lexical semantics (2009) 0.01
    0.0134631805 = product of:
      0.04038954 = sum of:
        0.04038954 = product of:
          0.08077908 = sum of:
            0.08077908 = weight(_text_:22 in 4888) [ClassicSimilarity], result of:
              0.08077908 = score(doc=4888,freq=2.0), product of:
                0.17398734 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.049684696 = queryNorm
                0.46428138 = fieldWeight in 4888, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.09375 = fieldNorm(doc=4888)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Date
    1. 3.2013 14:56:22
  12. Monnerjahn, P.: Vorsprung ohne Technik : Übersetzen: Computer und Qualität (2000) 0.01
    0.0134631805 = product of:
      0.04038954 = sum of:
        0.04038954 = product of:
          0.08077908 = sum of:
            0.08077908 = weight(_text_:22 in 5429) [ClassicSimilarity], result of:
              0.08077908 = score(doc=5429,freq=2.0), product of:
                0.17398734 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.049684696 = queryNorm
                0.46428138 = fieldWeight in 5429, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.09375 = fieldNorm(doc=5429)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Source
    c't. 2000, H.22, S.230-231
  13. Sabourin, C.F. (Bearb.): Computational linguistics in information science : bibliography (1994) 0.01
    0.0134057235 = product of:
      0.04021717 = sum of:
        0.04021717 = product of:
          0.08043434 = sum of:
            0.08043434 = weight(_text_:indexing in 8280) [ClassicSimilarity], result of:
              0.08043434 = score(doc=8280,freq=2.0), product of:
                0.19018644 = queryWeight, product of:
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.049684696 = queryNorm
                0.42292362 = fieldWeight in 8280, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.078125 = fieldNorm(doc=8280)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    The bibliography covers information retrieval (2100 refs.), fulltext (890) or conceptual (60), automatic indexing (930), information extraction (520), query languages (1090), etc.; altogether 6390 references, fully indexed
  14. Zimmermann, H.H.: Language and language technology (1991) 0.01
    0.0134057235 = product of:
      0.04021717 = sum of:
        0.04021717 = product of:
          0.08043434 = sum of:
            0.08043434 = weight(_text_:indexing in 2568) [ClassicSimilarity], result of:
              0.08043434 = score(doc=2568,freq=2.0), product of:
                0.19018644 = queryWeight, product of:
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.049684696 = queryNorm
                0.42292362 = fieldWeight in 2568, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.078125 = fieldNorm(doc=2568)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    Considers aspects of language and linguistic studies that directly affect information handling including: electronic word processing (hyphenation, spelling correction, dictionary-based synonym provision); man-machine communication; machine understanding of spoken language; automatic indexing; and machine translation
  15. Driscoll, J.R.; Rajala, D.A.; Shaffer, W.H.: ¬The operation and performance of an artificially intelligent keywording system (1991) 0.01
    0.013270989 = product of:
      0.039812967 = sum of:
        0.039812967 = product of:
          0.079625934 = sum of:
            0.079625934 = weight(_text_:indexing in 6681) [ClassicSimilarity], result of:
              0.079625934 = score(doc=6681,freq=4.0), product of:
                0.19018644 = queryWeight, product of:
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.049684696 = queryNorm
                0.41867304 = fieldWeight in 6681, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=6681)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    Presents a new approach to text analysis for automating the key phrase indexing process, using artificial intelligence techniques. This mimics the behaviour of human experts by using a rule base consisting of insertion and deletion rules generated by subject-matter experts. The insertion rules are based on the idea that some phrases found in a text imply or trigger other phrases. The deletion rules apply to semantically ambiguous phrases where text presence alone does not determine appropriateness as a key phrase. The insertion and deletion rules are used to transform a list of found phrases to a list of key phrases for indexing a document. Statistical data are provided to demonstrate the performance of this expert rule based system
  16. Ruiz, M.E.; Srinivasan, P.: Combining machine learning and hierarchical indexing structures for text categorization (2001) 0.01
    0.013270989 = product of:
      0.039812967 = sum of:
        0.039812967 = product of:
          0.079625934 = sum of:
            0.079625934 = weight(_text_:indexing in 1595) [ClassicSimilarity], result of:
              0.079625934 = score(doc=1595,freq=4.0), product of:
                0.19018644 = queryWeight, product of:
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.049684696 = queryNorm
                0.41867304 = fieldWeight in 1595, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=1595)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    This paper presents a method that exploits the hierarchical structure of an indexing vocabulary to guide the development and training of machine learning methods for automatic text categorization. We present the design of a hierarchical classifier based an the divide-and-conquer principle. The method is evaluated using backpropagation neural networks, such as the machine learning algorithm, that leam to assign MeSH categories to a subset of MEDLINE records. Comparisons with traditional Rocchio's algorithm adapted for text categorization, as well as flat neural network classifiers, are provided. The results indicate that the use of hierarchical structures improves Performance significantly.
  17. Kuhlen, R.: Experimentelle Morphologie in der Informationswissenschaft (1977) 0.01
    0.013270989 = product of:
      0.039812967 = sum of:
        0.039812967 = product of:
          0.079625934 = sum of:
            0.079625934 = weight(_text_:indexing in 4253) [ClassicSimilarity], result of:
              0.079625934 = score(doc=4253,freq=4.0), product of:
                0.19018644 = queryWeight, product of:
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.049684696 = queryNorm
                0.41867304 = fieldWeight in 4253, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=4253)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    LCSH
    Automatic indexing
    Subject
    Automatic indexing
  18. Vlachidis, A.; Binding, C.; Tudhope, D.; May, K.: Excavating grey literature : a case study on the rich indexing of archaeological documents via natural language-processing techniques and knowledge-based resources (2010) 0.01
    0.013134873 = product of:
      0.039404616 = sum of:
        0.039404616 = product of:
          0.07880923 = sum of:
            0.07880923 = weight(_text_:indexing in 3948) [ClassicSimilarity], result of:
              0.07880923 = score(doc=3948,freq=12.0), product of:
                0.19018644 = queryWeight, product of:
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.049684696 = queryNorm
                0.41437882 = fieldWeight in 3948, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.03125 = fieldNorm(doc=3948)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    Purpose - This paper sets out to discuss the use of information extraction (IE), a natural language-processing (NLP) technique to assist "rich" semantic indexing of diverse archaeological text resources. The focus of the research is to direct a semantic-aware "rich" indexing of diverse natural language resources with properties capable of satisfying information retrieval from online publications and datasets associated with the Semantic Technologies for Archaeological Resources (STAR) project. Design/methodology/approach - The paper proposes use of the English Heritage extension (CRM-EH) of the standard core ontology in cultural heritage, CIDOC CRM, and exploitation of domain thesauri resources for driving and enhancing an Ontology-Oriented Information Extraction process. The process of semantic indexing is based on a rule-based Information Extraction technique, which is facilitated by the General Architecture of Text Engineering (GATE) toolkit and expressed by Java Annotation Pattern Engine (JAPE) rules. Findings - Initial results suggest that the combination of information extraction with knowledge resources and standard conceptual models is capable of supporting semantic-aware term indexing. Additional efforts are required for further exploitation of the technique and adoption of formal evaluation methods for assessing the performance of the method in measurable terms. Originality/value - The value of the paper lies in the semantic indexing of 535 unpublished online documents often referred to as "Grey Literature", from the Archaeological Data Service OASIS corpus (Online AccesS to the Index of archaeological investigationS), with respect to the CRM ontological concepts E49.Time Appellation and P19.Physical Object.
  19. Witschel, H.F.: Terminology extraction and automatic indexing : comparison and qualitative evaluation of methods (2005) 0.01
    0.011609698 = product of:
      0.03482909 = sum of:
        0.03482909 = product of:
          0.06965818 = sum of:
            0.06965818 = weight(_text_:indexing in 1842) [ClassicSimilarity], result of:
              0.06965818 = score(doc=1842,freq=6.0), product of:
                0.19018644 = queryWeight, product of:
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.049684696 = queryNorm
                0.3662626 = fieldWeight in 1842, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1842)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    Many terminology engineering processes involve the task of automatic terminology extraction: before the terminology of a given domain can be modelled, organised or standardised, important concepts (or terms) of this domain have to be identified and fed into terminological databases. These serve in further steps as a starting point for compiling dictionaries, thesauri or maybe even terminological ontologies for the domain. For the extraction of the initial concepts, extraction methods are needed that operate on specialised language texts. On the other hand, many machine learning or information retrieval applications require automatic indexing techniques. In Machine Learning applications concerned with the automatic clustering or classification of texts, often feature vectors are needed that describe the contents of a given text briefly but meaningfully. These feature vectors typically consist of a fairly small set of index terms together with weights indicating their importance. Short but meaningful descriptions of document contents as provided by good index terms are also useful to humans: some knowledge management applications (e.g. topic maps) use them as a set of basic concepts (topics). The author believes that the tasks of terminology extraction and automatic indexing have much in common and can thus benefit from the same set of basic algorithms. It is the goal of this paper to outline some methods that may be used in both contexts, but also to find the discriminating factors between the two tasks that call for the variation of parameters or application of different techniques. The discussion of these methods will be based on statistical, syntactical and especially morphological properties of (index) terms. The paper is concluded by the presentation of some qualitative and quantitative results comparing statistical and morphological methods.
  20. Anguiano Peña, G.; Naumis Peña, C.: Method for selecting specialized terms from a general language corpus (2015) 0.01
    0.011375135 = product of:
      0.034125403 = sum of:
        0.034125403 = product of:
          0.068250805 = sum of:
            0.068250805 = weight(_text_:indexing in 2196) [ClassicSimilarity], result of:
              0.068250805 = score(doc=2196,freq=4.0), product of:
                0.19018644 = queryWeight, product of:
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.049684696 = queryNorm
                0.3588626 = fieldWeight in 2196, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.8278677 = idf(docFreq=2614, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2196)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Abstract
    Among the many aspects studied by library and information science are linguistic phenomena associated with document content analysis, for purposes of both information organization and retrieval. To this end, terms used in scientific and technical language must be recovered and their area of domain and behavior studied. Through language, society controls the knowledge available to people. Document content analysis, in this case of scientific texts, facilitates gathering knowledge of lexical units and their major applications and separating such specialized terms from the general language, to create indexing languages. The model presented here or other lexicographic resources with similar characteristics may be useful in the near future, in computer-assisted indexing or as corpora monitors, with respect to new text analyses or specialized corpora. Thus, using techniques for document content analysis of a lexicographically labeled general language corpus proposed herein, components which enable the extraction of lexical units from specialized language may be obtained and characterized.

Languages

  • e 78
  • d 21
  • da 1
  • f 1
  • m 1
  • More… Less…

Types

  • a 82
  • m 9
  • el 8
  • s 4
  • p 3
  • x 2
  • b 1
  • d 1
  • More… Less…

Classifications