Search (38 results, page 1 of 2)

  • × theme_ss:"Automatisches Indexieren"
  • × year_i:[2000 TO 2010}
  1. Newman, D.J.; Block, S.: Probabilistic topic decomposition of an eighteenth-century American newspaper (2006) 0.02
    0.017917141 = product of:
      0.044792853 = sum of:
        0.024983391 = weight(_text_:of in 5291) [ClassicSimilarity], result of:
          0.024983391 = score(doc=5291,freq=20.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.38244802 = fieldWeight in 5291, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0546875 = fieldNorm(doc=5291)
        0.019809462 = product of:
          0.039618924 = sum of:
            0.039618924 = weight(_text_:22 in 5291) [ClassicSimilarity], result of:
              0.039618924 = score(doc=5291,freq=2.0), product of:
                0.14628662 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04177434 = queryNorm
                0.2708308 = fieldWeight in 5291, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=5291)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    We use a probabilistic mixture decomposition method to determine topics in the Pennsylvania Gazette, a major colonial U.S. newspaper from 1728-1800. We assess the value of several topic decomposition techniques for historical research and compare the accuracy and efficacy of various methods. After determining the topics covered by the 80,000 articles and advertisements in the entire 18th century run of the Gazette, we calculate how the prevalence of those topics changed over time, and give historically relevant examples of our findings. This approach reveals important information about the content of this colonial newspaper, and suggests the value of such approaches to a more complete understanding of early American print culture and society.
    Date
    22. 7.2006 17:32:00
    Source
    Journal of the American Society for Information Science and Technology. 57(2006) no.6, S.753-767
  2. Ahlgren, P.; Kekäläinen, J.: Indexing strategies for Swedish full text retrieval under different user scenarios (2007) 0.01
    0.013149628 = product of:
      0.03287407 = sum of:
        0.0117592495 = product of:
          0.058796246 = sum of:
            0.058796246 = weight(_text_:problem in 896) [ClassicSimilarity], result of:
              0.058796246 = score(doc=896,freq=4.0), product of:
                0.17731056 = queryWeight, product of:
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.04177434 = queryNorm
                0.33160037 = fieldWeight in 896, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=896)
          0.2 = coord(1/5)
        0.02111482 = weight(_text_:of in 896) [ClassicSimilarity], result of:
          0.02111482 = score(doc=896,freq=28.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.32322758 = fieldWeight in 896, product of:
              5.2915025 = tf(freq=28.0), with freq of:
                28.0 = termFreq=28.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=896)
      0.4 = coord(2/5)
    
    Abstract
    This paper deals with Swedish full text retrieval and the problem of morphological variation of query terms in the document database. The effects of combination of indexing strategies with query terms on retrieval effectiveness were studied. Three of five tested combinations involved indexing strategies that used conflation, in the form of normalization. Further, two of these three combinations used indexing strategies that employed compound splitting. Normalization and compound splitting were performed by SWETWOL, a morphological analyzer for the Swedish language. A fourth combination attempted to group related terms by right hand truncation of query terms. The four combinations were compared to each other and to a baseline combination, where no attempt was made to counteract the problem of morphological variation of query terms in the document database. The five combinations were evaluated under six different user scenarios, where each scenario simulated a certain user type. The four alternative combinations outperformed the baseline, for each user scenario. The truncation combination had the best performance under each user scenario. The main conclusion of the paper is that normalization and right hand truncation (performed by a search expert) enhanced retrieval effectiveness in comparison to the baseline. The performance of the three combinations of indexing strategies with query terms based on normalization was not far below the performance of the truncation combination.
  3. Bloomfield, M.: Indexing : neglected and poorly understood (2001) 0.01
    0.0116526475 = product of:
      0.029131617 = sum of:
        0.009978054 = product of:
          0.04989027 = sum of:
            0.04989027 = weight(_text_:problem in 5439) [ClassicSimilarity], result of:
              0.04989027 = score(doc=5439,freq=2.0), product of:
                0.17731056 = queryWeight, product of:
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.04177434 = queryNorm
                0.28137225 = fieldWeight in 5439, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.046875 = fieldNorm(doc=5439)
          0.2 = coord(1/5)
        0.019153563 = weight(_text_:of in 5439) [ClassicSimilarity], result of:
          0.019153563 = score(doc=5439,freq=16.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.2932045 = fieldWeight in 5439, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=5439)
      0.4 = coord(2/5)
    
    Abstract
    The growth of the Internet has highlighted the use of machine indexing. The difficulties in using the Internet as a searching device can be frustrating. The use of the term "Python" is given as an example. Machine indexing is noted as "rotten" and human indexing as "capricious." The problem seems to be a lack of a theoretical foundation for the art of indexing. What librarians have learned over the last hundred years has yet to yield a consistent approach to what really works best in preparing index terms and in the ability of our customers to search the various indexes. An attempt is made to consider the elements of indexing, their pros and cons. The argument is made that machine indexing is far too prolific in its production of index terms. Neither librarians nor computer programmers have made much progress to improve Internet indexing. Human indexing has had the same problems for over fifty years.
  4. Mansour, N.; Haraty, R.A.; Daher, W.; Houri, M.: ¬An auto-indexing method for Arabic text (2008) 0.01
    0.010626211 = product of:
      0.026565526 = sum of:
        0.009978054 = product of:
          0.04989027 = sum of:
            0.04989027 = weight(_text_:problem in 2103) [ClassicSimilarity], result of:
              0.04989027 = score(doc=2103,freq=2.0), product of:
                0.17731056 = queryWeight, product of:
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.04177434 = queryNorm
                0.28137225 = fieldWeight in 2103, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2103)
          0.2 = coord(1/5)
        0.016587472 = weight(_text_:of in 2103) [ClassicSimilarity], result of:
          0.016587472 = score(doc=2103,freq=12.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.25392252 = fieldWeight in 2103, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=2103)
      0.4 = coord(2/5)
    
    Abstract
    This work addresses the information retrieval problem of auto-indexing Arabic documents. Auto-indexing a text document refers to automatically extracting words that are suitable for building an index for the document. In this paper, we propose an auto-indexing method for Arabic text documents. This method is mainly based on morphological analysis and on a technique for assigning weights to words. The morphological analysis uses a number of grammatical rules to extract stem words that become candidate index words. The weight assignment technique computes weights for these words relative to the container document. The weight is based on how spread is the word in a document and not only on its rate of occurrence. The candidate index words are then sorted in descending order by weight so that information retrievers can select the more important index words. We empirically verify the usefulness of our method using several examples. For these examples, we obtained an average recall of 46% and an average precision of 64%.
  5. Snajder, J.; Dalbelo Basic, B.D.; Tadic, M.: Automatic acquisition of inflectional lexica for morphological normalisation (2008) 0.01
    0.010048111 = product of:
      0.025120277 = sum of:
        0.009978054 = product of:
          0.04989027 = sum of:
            0.04989027 = weight(_text_:problem in 2910) [ClassicSimilarity], result of:
              0.04989027 = score(doc=2910,freq=2.0), product of:
                0.17731056 = queryWeight, product of:
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.04177434 = queryNorm
                0.28137225 = fieldWeight in 2910, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2910)
          0.2 = coord(1/5)
        0.015142222 = weight(_text_:of in 2910) [ClassicSimilarity], result of:
          0.015142222 = score(doc=2910,freq=10.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.23179851 = fieldWeight in 2910, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=2910)
      0.4 = coord(2/5)
    
    Abstract
    Due to natural language morphology, words can take on various morphological forms. Morphological normalisation - often used in information retrieval and text mining systems - conflates morphological variants of a word to a single representative form. In this paper, we describe an approach to lexicon-based inflectional normalisation. This approach is in between stemming and lemmatisation, and is suitable for morphological normalisation of inflectionally complex languages. To eliminate the immense effort required to compile the lexicon by hand, we focus on the problem of acquiring automatically an inflectional morphological lexicon from raw corpora. We propose a convenient and highly expressive morphology representation formalism on which the acquisition procedure is based. Our approach is applied to the morphologically complex Croatian language, but it should be equally applicable to other languages of similar morphological complexity. Experimental results show that our approach can be used to acquire a lexicon whose linguistic quality allows for rather good normalisation performance.
  6. Humphrey, S.M.; Névéol, A.; Browne, A.; Gobeil, J.; Ruch, P.; Darmoni, S.J.: Comparing a rule-based versus statistical system for automatic categorization of MEDLINE documents according to biomedical specialty (2009) 0.01
    0.009710538 = product of:
      0.024276346 = sum of:
        0.008315044 = product of:
          0.041575223 = sum of:
            0.041575223 = weight(_text_:problem in 3300) [ClassicSimilarity], result of:
              0.041575223 = score(doc=3300,freq=2.0), product of:
                0.17731056 = queryWeight, product of:
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.04177434 = queryNorm
                0.23447686 = fieldWeight in 3300, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3300)
          0.2 = coord(1/5)
        0.015961302 = weight(_text_:of in 3300) [ClassicSimilarity], result of:
          0.015961302 = score(doc=3300,freq=16.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.24433708 = fieldWeight in 3300, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3300)
      0.4 = coord(2/5)
    
    Abstract
    Automatic document categorization is an important research problem in Information Science and Natural Language Processing. Many applications, including, Word Sense Disambiguation and Information Retrieval in large collections, can benefit from such categorization. This paper focuses on automatic categorization of documents from the biomedical literature into broad discipline-based categories. Two different systems are described and contrasted: CISMeF, which uses rules based on human indexing of the documents by the Medical Subject Headings (MeSH) controlled vocabulary in order to assign metaterms (MTs), and Journal Descriptor Indexing (JDI), based on human categorization of about 4,000 journals and statistical associations between journal descriptors (JDs) and textwords in the documents. We evaluate and compare the performance of these systems against a gold standard of humanly assigned categories for 100 MEDLINE documents, using six measures selected from trec_eval. The results show that for five of the measures performance is comparable, and for one measure JDI is superior. We conclude that these results favor JDI, given the significantly greater intellectual overhead involved in human indexing and maintaining a rule base for mapping MeSH terms to MTs. We also note a JDI method that associates JDs with MeSH indexing rather than textwords, and it may be worthwhile to investigate whether this JDI method (statistical) and CISMeF (rule-based) might be combined and then evaluated showing they are complementary to one another.
    Source
    Journal of the American Society for Information Science and Technology. 60(2009) no.12, S.2530-2539
  7. Hlava, M.M.K.: Automatic indexing : comparing rule-based and statistics-based indexing systems (2005) 0.01
    0.007923785 = product of:
      0.039618924 = sum of:
        0.039618924 = product of:
          0.07923785 = sum of:
            0.07923785 = weight(_text_:22 in 6265) [ClassicSimilarity], result of:
              0.07923785 = score(doc=6265,freq=2.0), product of:
                0.14628662 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04177434 = queryNorm
                0.5416616 = fieldWeight in 6265, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.109375 = fieldNorm(doc=6265)
          0.5 = coord(1/2)
      0.2 = coord(1/5)
    
    Source
    Information outlook. 9(2005) no.8, S.22-23
  8. Hauer, M.: Automatische Indexierung (2000) 0.01
    0.006791815 = product of:
      0.033959076 = sum of:
        0.033959076 = product of:
          0.06791815 = sum of:
            0.06791815 = weight(_text_:22 in 5887) [ClassicSimilarity], result of:
              0.06791815 = score(doc=5887,freq=2.0), product of:
                0.14628662 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04177434 = queryNorm
                0.46428138 = fieldWeight in 5887, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.09375 = fieldNorm(doc=5887)
          0.5 = coord(1/2)
      0.2 = coord(1/5)
    
    Source
    Wissen in Aktion: Wege des Knowledge Managements. 22. Online-Tagung der DGI, Frankfurt am Main, 2.-4.5.2000. Proceedings. Hrsg.: R. Schmidt
  9. Roberts, D.; Souter, C.: ¬The automation of controlled vocabulary subject indexing of medical journal articles (2000) 0.01
    0.005584175 = product of:
      0.027920876 = sum of:
        0.027920876 = weight(_text_:of in 711) [ClassicSimilarity], result of:
          0.027920876 = score(doc=711,freq=34.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.4274153 = fieldWeight in 711, product of:
              5.8309517 = tf(freq=34.0), with freq of:
                34.0 = termFreq=34.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=711)
      0.2 = coord(1/5)
    
    Abstract
    This article discusses the possibility of the automation of sophisticated subject indexing of medical journal articles. Approaches to subject descriptor assignment in information retrieval research are usually either based upon the manual descriptors in the database or generation of search parameters from the text of the article. The principles of the Medline indexing system are described, followed by a summary of a pilot project, based upon the Amed database. The results suggest that a more extended study, based upon Medline, should encompass various components: Extraction of 'concept strings' from titles and abstracts of records, based upon linguistic features characteristic of medical literature. Use of the Unified Medical Language System (UMLS) for identification of controlled vocabulary descriptors. Coordination of descriptors, utilising features of the Medline indexing system. The emphasis should be on system manipulation of data, based upon input, available resources and specifically designed rules.
  10. Witschel, H.F.: Terminology extraction and automatic indexing : comparison and qualitative evaluation of methods (2005) 0.01
    0.0050474075 = product of:
      0.025237037 = sum of:
        0.025237037 = weight(_text_:of in 1842) [ClassicSimilarity], result of:
          0.025237037 = score(doc=1842,freq=40.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.38633084 = fieldWeight in 1842, product of:
              6.3245554 = tf(freq=40.0), with freq of:
                40.0 = termFreq=40.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1842)
      0.2 = coord(1/5)
    
    Abstract
    Many terminology engineering processes involve the task of automatic terminology extraction: before the terminology of a given domain can be modelled, organised or standardised, important concepts (or terms) of this domain have to be identified and fed into terminological databases. These serve in further steps as a starting point for compiling dictionaries, thesauri or maybe even terminological ontologies for the domain. For the extraction of the initial concepts, extraction methods are needed that operate on specialised language texts. On the other hand, many machine learning or information retrieval applications require automatic indexing techniques. In Machine Learning applications concerned with the automatic clustering or classification of texts, often feature vectors are needed that describe the contents of a given text briefly but meaningfully. These feature vectors typically consist of a fairly small set of index terms together with weights indicating their importance. Short but meaningful descriptions of document contents as provided by good index terms are also useful to humans: some knowledge management applications (e.g. topic maps) use them as a set of basic concepts (topics). The author believes that the tasks of terminology extraction and automatic indexing have much in common and can thus benefit from the same set of basic algorithms. It is the goal of this paper to outline some methods that may be used in both contexts, but also to find the discriminating factors between the two tasks that call for the variation of parameters or application of different techniques. The discussion of these methods will be based on statistical, syntactical and especially morphological properties of (index) terms. The paper is concluded by the presentation of some qualitative and quantitative results comparing statistical and morphological methods.
    Source
    TKE 2005: Proc. of Terminology and Knowledge Engineering (TKE) 2005
  11. Pulgarin, A.; Gil-Leiva, I.: Bibliometric analysis of the automatic indexing literature : 1956-2000 (2004) 0.00
    0.004740265 = product of:
      0.023701325 = sum of:
        0.023701325 = weight(_text_:of in 2566) [ClassicSimilarity], result of:
          0.023701325 = score(doc=2566,freq=18.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.36282203 = fieldWeight in 2566, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2566)
      0.2 = coord(1/5)
    
    Abstract
    We present a bibliometric study of a corpus of 839 bibliographic references about automatic indexing, covering the period 1956-2000. We analyse the distribution of authors and works, the obsolescence and its dispersion, and the distribution of the literature by topic, year, and source type. We conclude that: (i) there has been a constant interest on the part of researchers; (ii) the most studied topics were the techniques and methods employed and the general aspects of automatic indexing; (iii) the productivity of the authors does fit a Lotka distribution (Dmax=0.02 and critical value=0.054); (iv) the annual aging factor is 95%; and (v) the dispersion of the literature is low.
  12. Lepsky, K.; Vorhauer, J.: Lingo - ein open source System für die Automatische Indexierung deutschsprachiger Dokumente (2006) 0.00
    0.0045278776 = product of:
      0.022639386 = sum of:
        0.022639386 = product of:
          0.045278773 = sum of:
            0.045278773 = weight(_text_:22 in 3581) [ClassicSimilarity], result of:
              0.045278773 = score(doc=3581,freq=2.0), product of:
                0.14628662 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04177434 = queryNorm
                0.30952093 = fieldWeight in 3581, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0625 = fieldNorm(doc=3581)
          0.5 = coord(1/2)
      0.2 = coord(1/5)
    
    Date
    24. 3.2006 12:22:02
  13. Probst, M.; Mittelbach, J.: Maschinelle Indexierung in der Sacherschließung wissenschaftlicher Bibliotheken (2006) 0.00
    0.0045278776 = product of:
      0.022639386 = sum of:
        0.022639386 = product of:
          0.045278773 = sum of:
            0.045278773 = weight(_text_:22 in 1755) [ClassicSimilarity], result of:
              0.045278773 = score(doc=1755,freq=2.0), product of:
                0.14628662 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04177434 = queryNorm
                0.30952093 = fieldWeight in 1755, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0625 = fieldNorm(doc=1755)
          0.5 = coord(1/2)
      0.2 = coord(1/5)
    
    Date
    22. 3.2008 12:35:19
  14. Moens, M.F.: Automatic indexing and abstracting of document texts (2000) 0.00
    0.0045145387 = product of:
      0.022572692 = sum of:
        0.022572692 = weight(_text_:of in 6892) [ClassicSimilarity], result of:
          0.022572692 = score(doc=6892,freq=8.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.34554482 = fieldWeight in 6892, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.078125 = fieldNorm(doc=6892)
      0.2 = coord(1/5)
    
    Content
    Need for indexing and abstracting texts; attributes of texts; text representations and their use; selection of natural language index terms; assignment of controlled language index texts; automatic abstracting; applications
  15. Pirkola, A.: Morphological typology of languages for IR (2001) 0.00
    0.0044919094 = product of:
      0.022459546 = sum of:
        0.022459546 = weight(_text_:of in 4476) [ClassicSimilarity], result of:
          0.022459546 = score(doc=4476,freq=22.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.34381276 = fieldWeight in 4476, product of:
              4.690416 = tf(freq=22.0), with freq of:
                22.0 = termFreq=22.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=4476)
      0.2 = coord(1/5)
    
    Abstract
    This paper presents a morphological classification of languages from the IR perspective. Linguistic typology research has shown that the morphological complexity of every language in the world can be described by two variables, index of synthesis and index of fusion. These variables provide a theoretical basis for IR research handling morphological issues. A common theoretical framework is needed in particular because of the increasing significance of cross-language retrieval research and CLIR systems processing different languages. The paper elaborates the linguistic morphological typology for the purposes of IR research. It studies how the indexes of synthesis and fusion could be used as practical tools in mono- and cross-lingual IR research. The need for semantic and syntactic typologies is discussed. The paper also reviews studies made in different languages on the effects of morphology and stemming in IR.
    Source
    Journal of documentation. 57(2001) no.3, S.330-348
  16. Galvez, C.; Moya-Anegón, F. de: ¬An evaluation of conflation accuracy using finite-state transducers (2006) 0.00
    0.0044919094 = product of:
      0.022459546 = sum of:
        0.022459546 = weight(_text_:of in 5599) [ClassicSimilarity], result of:
          0.022459546 = score(doc=5599,freq=22.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.34381276 = fieldWeight in 5599, product of:
              4.690416 = tf(freq=22.0), with freq of:
                22.0 = termFreq=22.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=5599)
      0.2 = coord(1/5)
    
    Abstract
    Purpose - To evaluate the accuracy of conflation methods based on finite-state transducers (FSTs). Design/methodology/approach - Incorrectly lemmatized and stemmed forms may lead to the retrieval of inappropriate documents. Experimental studies to date have focused on retrieval performance, but very few on conflation performance. The process of normalization we used involved a linguistic toolbox that allowed us to construct, through graphic interfaces, electronic dictionaries represented internally by FSTs. The lexical resources developed were applied to a Spanish test corpus for merging term variants in canonical lemmatized forms. Conflation performance was evaluated in terms of an adaptation of recall and precision measures, based on accuracy and coverage, not actual retrieval. The results were compared with those obtained using a Spanish version of the Porter algorithm. Findings - The conclusion is that the main strength of lemmatization is its accuracy, whereas its main limitation is the underanalysis of variant forms. Originality/value - The report outlines the potential of transducers in their application to normalization processes.
    Source
    Journal of documentation. 62(2006) no.3, S.328-349
  17. Souza, R.R.; Raghavan, K.S.: ¬A methodology for noun phrase-based automatic indexing (2006) 0.00
    0.0044919094 = product of:
      0.022459546 = sum of:
        0.022459546 = weight(_text_:of in 173) [ClassicSimilarity], result of:
          0.022459546 = score(doc=173,freq=22.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.34381276 = fieldWeight in 173, product of:
              4.690416 = tf(freq=22.0), with freq of:
                22.0 = termFreq=22.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=173)
      0.2 = coord(1/5)
    
    Abstract
    The scholarly community is increasingly employing the Web both for publication of scholarly output and for locating and accessing relevant scholarly literature. Organization of this vast body of digital information assumes significance in this context. The sheer volume of digital information to be handled makes traditional indexing and knowledge representation strategies ineffective and impractical. It is, therefore, worth exploring new approaches. An approach being discussed considers the intrinsic semantics of texts of documents. Based on the hypothesis that noun phrases in a text are semantically rich in terms of their ability to represent the subject content of the document, this approach seeks to identify and extract noun phrases instead of single keywords, and use them as descriptors. This paper presents a methodology that has been developed for extracting noun phrases from Portuguese texts. The results of an experiment carried out to test the adequacy of the methodology are also presented.
  18. Hlava, M.M.: Automatic indexing : a matter of degree (2002) 0.00
    0.004469165 = product of:
      0.022345824 = sum of:
        0.022345824 = weight(_text_:of in 2501) [ClassicSimilarity], result of:
          0.022345824 = score(doc=2501,freq=4.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.34207192 = fieldWeight in 2501, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.109375 = fieldNorm(doc=2501)
      0.2 = coord(1/5)
    
    Source
    Bulletin of the American Society for Information Science. 28(2002) no.1, S.12-15
  19. Chung, Y.M.; Lee, J.Y.: ¬A corpus-based approach to comparative evaluation of statistical term association measures (2001) 0.00
    0.004371183 = product of:
      0.021855915 = sum of:
        0.021855915 = weight(_text_:of in 5769) [ClassicSimilarity], result of:
          0.021855915 = score(doc=5769,freq=30.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.33457235 = fieldWeight in 5769, product of:
              5.477226 = tf(freq=30.0), with freq of:
                30.0 = termFreq=30.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5769)
      0.2 = coord(1/5)
    
    Abstract
    Statistical association measures have been widely applied in information retrieval research, usually employing a clustering of documents or terms on the basis of their relationships. Applications of the association measures for term clustering include automatic thesaurus construction and query expansion. This research evaluates the similarity of six association measures by comparing the relationship and behavior they demonstrate in various analyses of a test corpus. Analysis techniques include comparisons of highly ranked term pairs and term clusters, analyses of the correlation among the association measures using Pearson's correlation coefficient and MDS mapping, and an analysis of the impact of a term frequency on the association values by means of z-score. The major findings of the study are as follows: First, the most similar association measures are mutual information and Yule's coefficient of colligation Y, whereas cosine and Jaccard coefficients, as well as X**2 statistic and likelihood ratio, demonstrate quite similar behavior for terms with high frequency. Second, among all the measures, the X**2 statistic is the least affected by the frequency of terms. Third, although cosine and Jaccard coefficients tend to emphasize high frequency terms, mutual information and Yule's Y seem to overestimate rare terms
    Source
    Journal of the American Society for Information Science and technology. 52(2001) no.4, S.283-296
  20. Medelyan, O.; Witten, I.H.: Domain-independent automatic keyphrase indexing with small training sets (2008) 0.00
    0.0040630843 = product of:
      0.02031542 = sum of:
        0.02031542 = weight(_text_:of in 1871) [ClassicSimilarity], result of:
          0.02031542 = score(doc=1871,freq=18.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.3109903 = fieldWeight in 1871, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=1871)
      0.2 = coord(1/5)
    
    Abstract
    Keyphrases are widely used in both physical and digital libraries as a brief, but precise, summary of documents. They help organize material based on content, provide thematic access, represent search results, and assist with navigation. Manual assignment is expensive because trained human indexers must reach an understanding of the document and select appropriate descriptors according to defined cataloging rules. We propose a new method that enhances automatic keyphrase extraction by using semantic information about terms and phrases gleaned from a domain-specific thesaurus. The key advantage of the new approach is that it performs well with very little training data. We evaluate it on a large set of manually indexed documents in the domain of agriculture, compare its consistency with a group of six professional indexers, and explore its performance on smaller collections of documents in other domains and of French and Spanish documents.
    Source
    Journal of the American Society for Information Science and Technology. 59(2008) no.7, S.1026-1040