Search (24 results, page 1 of 2)

  • × language_ss:"e"
  • × theme_ss:"Automatisches Indexieren"
  • × year_i:[2000 TO 2010}
  1. Hlava, M.M.K.: Automatic indexing : comparing rule-based and statistics-based indexing systems (2005) 0.02
    0.02089367 = product of:
      0.06268101 = sum of:
        0.020966241 = weight(_text_:information in 6265) [ClassicSimilarity], result of:
          0.020966241 = score(doc=6265,freq=2.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.27153665 = fieldWeight in 6265, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.109375 = fieldNorm(doc=6265)
        0.04171477 = product of:
          0.08342954 = sum of:
            0.08342954 = weight(_text_:22 in 6265) [ClassicSimilarity], result of:
              0.08342954 = score(doc=6265,freq=2.0), product of:
                0.1540252 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.043984205 = queryNorm
                0.5416616 = fieldWeight in 6265, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.109375 = fieldNorm(doc=6265)
          0.5 = coord(1/2)
      0.33333334 = coord(2/6)
    
    Source
    Information outlook. 9(2005) no.8, S.22-23
  2. Rasmussen, E.M.: Indexing and retrieval for the Web (2002) 0.02
    0.019450821 = product of:
      0.058352463 = sum of:
        0.020300474 = weight(_text_:information in 4285) [ClassicSimilarity], result of:
          0.020300474 = score(doc=4285,freq=30.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.2629142 = fieldWeight in 4285, product of:
              5.477226 = tf(freq=30.0), with freq of:
                30.0 = termFreq=30.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.02734375 = fieldNorm(doc=4285)
        0.03805199 = weight(_text_:networks in 4285) [ClassicSimilarity], result of:
          0.03805199 = score(doc=4285,freq=2.0), product of:
            0.20804176 = queryWeight, product of:
              4.72992 = idf(docFreq=1060, maxDocs=44218)
              0.043984205 = queryNorm
            0.18290554 = fieldWeight in 4285, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.72992 = idf(docFreq=1060, maxDocs=44218)
              0.02734375 = fieldNorm(doc=4285)
      0.33333334 = coord(2/6)
    
    Abstract
    The introduction and growth of the World Wide Web (WWW, or Web) have resulted in a profound change in the way individuals and organizations access information. In terms of volume, nature, and accessibility, the characteristics of electronic information are significantly different from those of even five or six years ago. Control of, and access to, this flood of information rely heavily an automated techniques for indexing and retrieval. According to Gudivada, Raghavan, Grosky, and Kasanagottu (1997, p. 58), "The ability to search and retrieve information from the Web efficiently and effectively is an enabling technology for realizing its full potential." Almost 93 percent of those surveyed consider the Web an "indispensable" Internet technology, second only to e-mail (Graphie, Visualization & Usability Center, 1998). Although there are other ways of locating information an the Web (browsing or following directory structures), 85 percent of users identify Web pages by means of a search engine (Graphie, Visualization & Usability Center, 1998). A more recent study conducted by the Stanford Institute for the Quantitative Study of Society confirms the finding that searching for information is second only to e-mail as an Internet activity (Nie & Ebring, 2000, online). In fact, Nie and Ebring conclude, "... the Internet today is a giant public library with a decidedly commercial tilt. The most widespread use of the Internet today is as an information search utility for products, travel, hobbies, and general information. Virtually all users interviewed responded that they engaged in one or more of these information gathering activities."
    Techniques for automated indexing and information retrieval (IR) have been developed, tested, and refined over the past 40 years, and are well documented (see, for example, Agosti & Smeaton, 1996; BaezaYates & Ribeiro-Neto, 1999a; Frakes & Baeza-Yates, 1992; Korfhage, 1997; Salton, 1989; Witten, Moffat, & Bell, 1999). With the introduction of the Web, and the capability to index and retrieve via search engines, these techniques have been extended to a new environment. They have been adopted, altered, and in some Gases extended to include new methods. "In short, search engines are indispensable for searching the Web, they employ a variety of relatively advanced IR techniques, and there are some peculiar aspects of search engines that make searching the Web different than more conventional information retrieval" (Gordon & Pathak, 1999, p. 145). The environment for information retrieval an the World Wide Web differs from that of "conventional" information retrieval in a number of fundamental ways. The collection is very large and changes continuously, with pages being added, deleted, and altered. Wide variability between the size, structure, focus, quality, and usefulness of documents makes Web documents much more heterogeneous than a typical electronic document collection. The wide variety of document types includes images, video, audio, and scripts, as well as many different document languages. Duplication of documents and sites is common. Documents are interconnected through networks of hyperlinks. Because of the size and dynamic nature of the Web, preprocessing all documents requires considerable resources and is often not feasible, certainly not an the frequent basis required to ensure currency. Query length is usually much shorter than in other environments-only a few words-and user behavior differs from that in other environments. These differences make the Web a novel environment for information retrieval (Baeza-Yates & Ribeiro-Neto, 1999b; Bharat & Henzinger, 1998; Huang, 2000).
    Source
    Annual review of information science and technology. 37(2003), S.91-126
  3. Newman, D.J.; Block, S.: Probabilistic topic decomposition of an eighteenth-century American newspaper (2006) 0.01
    0.011894252 = product of:
      0.035682756 = sum of:
        0.014825371 = weight(_text_:information in 5291) [ClassicSimilarity], result of:
          0.014825371 = score(doc=5291,freq=4.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.1920054 = fieldWeight in 5291, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0546875 = fieldNorm(doc=5291)
        0.020857384 = product of:
          0.04171477 = sum of:
            0.04171477 = weight(_text_:22 in 5291) [ClassicSimilarity], result of:
              0.04171477 = score(doc=5291,freq=2.0), product of:
                0.1540252 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.043984205 = queryNorm
                0.2708308 = fieldWeight in 5291, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=5291)
          0.5 = coord(1/2)
      0.33333334 = coord(2/6)
    
    Abstract
    We use a probabilistic mixture decomposition method to determine topics in the Pennsylvania Gazette, a major colonial U.S. newspaper from 1728-1800. We assess the value of several topic decomposition techniques for historical research and compare the accuracy and efficacy of various methods. After determining the topics covered by the 80,000 articles and advertisements in the entire 18th century run of the Gazette, we calculate how the prevalence of those topics changed over time, and give historically relevant examples of our findings. This approach reveals important information about the content of this colonial newspaper, and suggests the value of such approaches to a more complete understanding of early American print culture and society.
    Date
    22. 7.2006 17:32:00
    Source
    Journal of the American Society for Information Science and Technology. 57(2006) no.6, S.753-767
  4. Kantor, P.B.; Voorhees, E.: Information retrieval with scanned texts (2000) 0.01
    0.005647761 = product of:
      0.033886563 = sum of:
        0.033886563 = weight(_text_:information in 3901) [ClassicSimilarity], result of:
          0.033886563 = score(doc=3901,freq=4.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.43886948 = fieldWeight in 3901, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.125 = fieldNorm(doc=3901)
      0.16666667 = coord(1/6)
    
    Source
    Information retrieval. 2(2000), S.165-176
  5. Hlava, M.M.: Automatic indexing : a matter of degree (2002) 0.00
    0.0034943735 = product of:
      0.020966241 = sum of:
        0.020966241 = weight(_text_:information in 2501) [ClassicSimilarity], result of:
          0.020966241 = score(doc=2501,freq=2.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.27153665 = fieldWeight in 2501, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.109375 = fieldNorm(doc=2501)
      0.16666667 = coord(1/6)
    
    Source
    Bulletin of the American Society for Information Science. 28(2002) no.1, S.12-15
  6. Mongin, L.; Fu, Y.Y.; Mostafa, J.: Open Archives data Service prototype and automated subject indexing using D-Lib archive content as a testbed (2003) 0.00
    0.00334871 = product of:
      0.02009226 = sum of:
        0.02009226 = weight(_text_:information in 1167) [ClassicSimilarity], result of:
          0.02009226 = score(doc=1167,freq=10.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.2602176 = fieldWeight in 1167, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.046875 = fieldNorm(doc=1167)
      0.16666667 = coord(1/6)
    
    Abstract
    The Indiana University School of Library and Information Science opened a new research laboratory in January 2003; The Indiana University School of Library and Information Science Information Processing Laboratory [IU IP Lab]. The purpose of the new laboratory is to facilitate collaboration between scientists in the department in the areas of information retrieval (IR) and information visualization (IV) research. The lab has several areas of focus. These include grid and cluster computing, and a standard Java-based software platform to support plug and play research datasets, a selection of standard IR modules and standard IV algorithms. Future development includes software to enable researchers to contribute datasets, IR algorithms, and visualization algorithms into the standard environment. We decided early on to use OAI-PMH as a resource discovery tool because it is consistent with our mission.
  7. Li, W.; Wong, K.-F.; Yuan, C.: Toward automatic Chinese temporal information extraction (2001) 0.00
    0.0030569402 = product of:
      0.01834164 = sum of:
        0.01834164 = weight(_text_:information in 6029) [ClassicSimilarity], result of:
          0.01834164 = score(doc=6029,freq=12.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.23754507 = fieldWeight in 6029, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0390625 = fieldNorm(doc=6029)
      0.16666667 = coord(1/6)
    
    Abstract
    Over the past few years, temporal information processing and temporal database management have increasingly become hot topics. Nevertheless, only a few researchers have investigated these areas in the Chinese language. This lays down the objective of our research: to exploit Chinese language processing techniques for temporal information extraction and concept reasoning. In this article, we first study the mechanism for expressing time in Chinese. On the basis of the study, we then design a general frame structure for maintaining the extracted temporal concepts and propose a system for extracting time-dependent information from Hong Kong financial news. In the system, temporal knowledge is represented by different types of temporal concepts (TTC) and different temporal relations, including absolute and relative relations, which are used to correlate between action times and reference times. In analyzing a sentence, the algorithm first determines the situation related to the verb. This in turn will identify the type of temporal concept associated with the verb. After that, the relevant temporal information is extracted and the temporal relations are derived. These relations link relevant concept frames together in chronological order, which in turn provide the knowledge to fulfill users' queries, e.g., for question-answering (i.e., Q&A) applications
    Source
    Journal of the American Society for Information Science and technology. 52(2001) no.9, S.748-762
  8. Anderson, J.D.; Pérez-Carballo, J.: ¬The nature of indexing: how humans and machines analyze messages and texts for retrieval : Part I: Research and the nature of human indexing (2001) 0.00
    0.0029951772 = product of:
      0.017971063 = sum of:
        0.017971063 = weight(_text_:information in 3136) [ClassicSimilarity], result of:
          0.017971063 = score(doc=3136,freq=2.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.23274569 = fieldWeight in 3136, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.09375 = fieldNorm(doc=3136)
      0.16666667 = coord(1/6)
    
    Source
    Information processing and management. 37(2001) no.2, S.231-254
  9. Salton, G.: SMART System: 1961-1976 (2009) 0.00
    0.0028238804 = product of:
      0.016943282 = sum of:
        0.016943282 = weight(_text_:information in 3879) [ClassicSimilarity], result of:
          0.016943282 = score(doc=3879,freq=4.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.21943474 = fieldWeight in 3879, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0625 = fieldNorm(doc=3879)
      0.16666667 = coord(1/6)
    
    Abstract
    While a number of researchers had experimented during the 1950's on automatic indexing and retrieval in various forms, it was Gerard Salton who brought the information retrieval experimental paradigm to full fruition, with his "SMART" system. His work has been enormously influential.
    Source
    Encyclopedia of library and information sciences. 3rd ed. Ed.: M.J. Bates
  10. Mansour, N.; Haraty, R.A.; Daher, W.; Houri, M.: ¬An auto-indexing method for Arabic text (2008) 0.00
    0.0025938996 = product of:
      0.015563398 = sum of:
        0.015563398 = weight(_text_:information in 2103) [ClassicSimilarity], result of:
          0.015563398 = score(doc=2103,freq=6.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.20156369 = fieldWeight in 2103, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.046875 = fieldNorm(doc=2103)
      0.16666667 = coord(1/6)
    
    Abstract
    This work addresses the information retrieval problem of auto-indexing Arabic documents. Auto-indexing a text document refers to automatically extracting words that are suitable for building an index for the document. In this paper, we propose an auto-indexing method for Arabic text documents. This method is mainly based on morphological analysis and on a technique for assigning weights to words. The morphological analysis uses a number of grammatical rules to extract stem words that become candidate index words. The weight assignment technique computes weights for these words relative to the container document. The weight is based on how spread is the word in a document and not only on its rate of occurrence. The candidate index words are then sorted in descending order by weight so that information retrievers can select the more important index words. We empirically verify the usefulness of our method using several examples. For these examples, we obtained an average recall of 46% and an average precision of 64%.
    Source
    Information processing and management. 44(2008) no.4, S.1538-1545
  11. Chung, Y.M.; Lee, J.Y.: ¬A corpus-based approach to comparative evaluation of statistical term association measures (2001) 0.00
    0.002495981 = product of:
      0.014975886 = sum of:
        0.014975886 = weight(_text_:information in 5769) [ClassicSimilarity], result of:
          0.014975886 = score(doc=5769,freq=8.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.19395474 = fieldWeight in 5769, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5769)
      0.16666667 = coord(1/6)
    
    Abstract
    Statistical association measures have been widely applied in information retrieval research, usually employing a clustering of documents or terms on the basis of their relationships. Applications of the association measures for term clustering include automatic thesaurus construction and query expansion. This research evaluates the similarity of six association measures by comparing the relationship and behavior they demonstrate in various analyses of a test corpus. Analysis techniques include comparisons of highly ranked term pairs and term clusters, analyses of the correlation among the association measures using Pearson's correlation coefficient and MDS mapping, and an analysis of the impact of a term frequency on the association values by means of z-score. The major findings of the study are as follows: First, the most similar association measures are mutual information and Yule's coefficient of colligation Y, whereas cosine and Jaccard coefficients, as well as X**2 statistic and likelihood ratio, demonstrate quite similar behavior for terms with high frequency. Second, among all the measures, the X**2 statistic is the least affected by the frequency of terms. Third, although cosine and Jaccard coefficients tend to emphasize high frequency terms, mutual information and Yule's Y seem to overestimate rare terms
    Source
    Journal of the American Society for Information Science and technology. 52(2001) no.4, S.283-296
  12. Anderson, J.D.; Pérez-Carballo, J.: ¬The nature of indexing: how humans and machines analyze messages and texts for retrieval : Part II: Machine indexing, and the allocation of human versus machine effort (2001) 0.00
    0.002495981 = product of:
      0.014975886 = sum of:
        0.014975886 = weight(_text_:information in 368) [ClassicSimilarity], result of:
          0.014975886 = score(doc=368,freq=2.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.19395474 = fieldWeight in 368, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.078125 = fieldNorm(doc=368)
      0.16666667 = coord(1/6)
    
    Source
    Information processing and management. 37(2001) no.2, S.255-277
  13. Humphrey, S.M.; Névéol, A.; Browne, A.; Gobeil, J.; Ruch, P.; Darmoni, S.J.: Comparing a rule-based versus statistical system for automatic categorization of MEDLINE documents according to biomedical specialty (2009) 0.00
    0.0021615832 = product of:
      0.0129694985 = sum of:
        0.0129694985 = weight(_text_:information in 3300) [ClassicSimilarity], result of:
          0.0129694985 = score(doc=3300,freq=6.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.16796975 = fieldWeight in 3300, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3300)
      0.16666667 = coord(1/6)
    
    Abstract
    Automatic document categorization is an important research problem in Information Science and Natural Language Processing. Many applications, including, Word Sense Disambiguation and Information Retrieval in large collections, can benefit from such categorization. This paper focuses on automatic categorization of documents from the biomedical literature into broad discipline-based categories. Two different systems are described and contrasted: CISMeF, which uses rules based on human indexing of the documents by the Medical Subject Headings (MeSH) controlled vocabulary in order to assign metaterms (MTs), and Journal Descriptor Indexing (JDI), based on human categorization of about 4,000 journals and statistical associations between journal descriptors (JDs) and textwords in the documents. We evaluate and compare the performance of these systems against a gold standard of humanly assigned categories for 100 MEDLINE documents, using six measures selected from trec_eval. The results show that for five of the measures performance is comparable, and for one measure JDI is superior. We conclude that these results favor JDI, given the significantly greater intellectual overhead involved in human indexing and maintaining a rule base for mapping MeSH terms to MTs. We also note a JDI method that associates JDs with MeSH indexing rather than textwords, and it may be worthwhile to investigate whether this JDI method (statistical) and CISMeF (rule-based) might be combined and then evaluated showing they are complementary to one another.
    Source
    Journal of the American Society for Information Science and Technology. 60(2009) no.12, S.2530-2539
  14. Souza, R.R.; Raghavan, K.S.: ¬A methodology for noun phrase-based automatic indexing (2006) 0.00
    0.0021179102 = product of:
      0.012707461 = sum of:
        0.012707461 = weight(_text_:information in 173) [ClassicSimilarity], result of:
          0.012707461 = score(doc=173,freq=4.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.16457605 = fieldWeight in 173, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.046875 = fieldNorm(doc=173)
      0.16666667 = coord(1/6)
    
    Abstract
    The scholarly community is increasingly employing the Web both for publication of scholarly output and for locating and accessing relevant scholarly literature. Organization of this vast body of digital information assumes significance in this context. The sheer volume of digital information to be handled makes traditional indexing and knowledge representation strategies ineffective and impractical. It is, therefore, worth exploring new approaches. An approach being discussed considers the intrinsic semantics of texts of documents. Based on the hypothesis that noun phrases in a text are semantically rich in terms of their ability to represent the subject content of the document, this approach seeks to identify and extract noun phrases instead of single keywords, and use them as descriptors. This paper presents a methodology that has been developed for extracting noun phrases from Portuguese texts. The results of an experiment carried out to test the adequacy of the methodology are also presented.
  15. Medelyan, O.; Witten, I.H.: Domain-independent automatic keyphrase indexing with small training sets (2008) 0.00
    0.0021179102 = product of:
      0.012707461 = sum of:
        0.012707461 = weight(_text_:information in 1871) [ClassicSimilarity], result of:
          0.012707461 = score(doc=1871,freq=4.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.16457605 = fieldWeight in 1871, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.046875 = fieldNorm(doc=1871)
      0.16666667 = coord(1/6)
    
    Abstract
    Keyphrases are widely used in both physical and digital libraries as a brief, but precise, summary of documents. They help organize material based on content, provide thematic access, represent search results, and assist with navigation. Manual assignment is expensive because trained human indexers must reach an understanding of the document and select appropriate descriptors according to defined cataloging rules. We propose a new method that enhances automatic keyphrase extraction by using semantic information about terms and phrases gleaned from a domain-specific thesaurus. The key advantage of the new approach is that it performs well with very little training data. We evaluate it on a large set of manually indexed documents in the domain of agriculture, compare its consistency with a group of six professional indexers, and explore its performance on smaller collections of documents in other domains and of French and Spanish documents.
    Source
    Journal of the American Society for Information Science and Technology. 59(2008) no.7, S.1026-1040
  16. Snajder, J.; Dalbelo Basic, B.D.; Tadic, M.: Automatic acquisition of inflectional lexica for morphological normalisation (2008) 0.00
    0.0021179102 = product of:
      0.012707461 = sum of:
        0.012707461 = weight(_text_:information in 2910) [ClassicSimilarity], result of:
          0.012707461 = score(doc=2910,freq=4.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.16457605 = fieldWeight in 2910, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.046875 = fieldNorm(doc=2910)
      0.16666667 = coord(1/6)
    
    Abstract
    Due to natural language morphology, words can take on various morphological forms. Morphological normalisation - often used in information retrieval and text mining systems - conflates morphological variants of a word to a single representative form. In this paper, we describe an approach to lexicon-based inflectional normalisation. This approach is in between stemming and lemmatisation, and is suitable for morphological normalisation of inflectionally complex languages. To eliminate the immense effort required to compile the lexicon by hand, we focus on the problem of acquiring automatically an inflectional morphological lexicon from raw corpora. We propose a convenient and highly expressive morphology representation formalism on which the acquisition procedure is based. Our approach is applied to the morphologically complex Croatian language, but it should be equally applicable to other languages of similar morphological complexity. Experimental results show that our approach can be used to acquire a lexicon whose linguistic quality allows for rather good normalisation performance.
    Source
    Information processing and management. 44(2008) no.5, S.1720-1731
  17. Dolamic, L.; Savoy, J.: Indexing and searching strategies for the Russian language (2009) 0.00
    0.0017649251 = product of:
      0.01058955 = sum of:
        0.01058955 = weight(_text_:information in 3301) [ClassicSimilarity], result of:
          0.01058955 = score(doc=3301,freq=4.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.13714671 = fieldWeight in 3301, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3301)
      0.16666667 = coord(1/6)
    
    Abstract
    This paper describes and evaluates various stemming and indexing strategies for the Russian language. We design and evaluate two stemming approaches, a light and a more aggressive one, and compare these stemmers to the Snowball stemmer, to no stemming, and also to a language-independent approach (n-gram). To evaluate the suggested stemming strategies we apply various probabilistic information retrieval (IR) models, including the Okapi, the Divergence from Randomness (DFR), a statistical language model (LM), as well as two vector-space approaches, namely, the classical tf idf scheme and the dtu-dtn model. We find that the vector-space dtu-dtn and the DFR models tend to result in better retrieval effectiveness than the Okapi, LM, or tf idf models, while only the latter two IR approaches result in statistically significant performance differences. Ignoring stemming generally reduces the MAP by more than 50%, and these differences are always significant. When applying an n-gram approach, performance differences are usually lower than an approach involving stemming. Finally, our light stemmer tends to perform best, although performance differences between the light, aggressive, and Snowball stemmers are not statistically significant.
    Source
    Journal of the American Society for Information Science and Technology. 60(2009) no.12, S.2540-2547
  18. Pulgarin, A.; Gil-Leiva, I.: Bibliometric analysis of the automatic indexing literature : 1956-2000 (2004) 0.00
    0.0017471868 = product of:
      0.010483121 = sum of:
        0.010483121 = weight(_text_:information in 2566) [ClassicSimilarity], result of:
          0.010483121 = score(doc=2566,freq=2.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.13576832 = fieldWeight in 2566, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2566)
      0.16666667 = coord(1/6)
    
    Source
    Information processing and management. 40(2004) no.2, S.365-377
  19. Dolamic, L.; Savoy, J.: When stopword lists make the difference (2009) 0.00
    0.0017471868 = product of:
      0.010483121 = sum of:
        0.010483121 = weight(_text_:information in 3319) [ClassicSimilarity], result of:
          0.010483121 = score(doc=3319,freq=2.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.13576832 = fieldWeight in 3319, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.0546875 = fieldNorm(doc=3319)
      0.16666667 = coord(1/6)
    
    Source
    Journal of the American Society for Information Science and Technology. 61(2010) no.1, S.200-203
  20. Roberts, D.; Souter, C.: ¬The automation of controlled vocabulary subject indexing of medical journal articles (2000) 0.00
    0.0014975886 = product of:
      0.0089855315 = sum of:
        0.0089855315 = weight(_text_:information in 711) [ClassicSimilarity], result of:
          0.0089855315 = score(doc=711,freq=2.0), product of:
            0.0772133 = queryWeight, product of:
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.043984205 = queryNorm
            0.116372846 = fieldWeight in 711, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.7554779 = idf(docFreq=20772, maxDocs=44218)
              0.046875 = fieldNorm(doc=711)
      0.16666667 = coord(1/6)
    
    Abstract
    This article discusses the possibility of the automation of sophisticated subject indexing of medical journal articles. Approaches to subject descriptor assignment in information retrieval research are usually either based upon the manual descriptors in the database or generation of search parameters from the text of the article. The principles of the Medline indexing system are described, followed by a summary of a pilot project, based upon the Amed database. The results suggest that a more extended study, based upon Medline, should encompass various components: Extraction of 'concept strings' from titles and abstracts of records, based upon linguistic features characteristic of medical literature. Use of the Unified Medical Language System (UMLS) for identification of controlled vocabulary descriptors. Coordination of descriptors, utilising features of the Medline indexing system. The emphasis should be on system manipulation of data, based upon input, available resources and specifically designed rules.