Search (152 results, page 2 of 8)

  • × theme_ss:"Automatisches Klassifizieren"
  1. Malenica, M.; Smuc, T.; Snajder, J.; Basic, B.D.: Language morphology offset : text classification on a Croatian-English parallel corpus (2008) 0.01
    0.007307842 = product of:
      0.029231368 = sum of:
        0.029231368 = weight(_text_:for in 2035) [ClassicSimilarity], result of:
          0.029231368 = score(doc=2035,freq=14.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.32930255 = fieldWeight in 2035, product of:
              3.7416575 = tf(freq=14.0), with freq of:
                14.0 = termFreq=14.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.046875 = fieldNorm(doc=2035)
      0.25 = coord(1/4)
    
    Abstract
    We investigate how, and to what extent, morphological complexity of the language influences text classification using support vector machines (SVM). The Croatian-English parallel corpus provides the basis for direct comparison of two languages of radically different morphological complexity. We quantified, compared, and statistically tested the effects of morphological normalisation on SVM classifier performance based on a series of parallel experiments on both languages, carried over a large scale of different feature subset sizes obtained by different feature selection methods, and applying different levels of morphological normalisation. We also quantified the trade-off between feature space size and performance for different levels of morphological normalisation, and compared the results for both languages. Our experiments have shown that the improvements in SVM classifier performance is statistically significant; they are greater for small and medium number of features, especially for Croatian, whereas for large number of features the improvements are rather small and may be negligible in practice for both languages.
  2. Rose, J.R.; Gasteiger, J.: HORACE: an automatic system for the hierarchical classification of chemical reactions (1994) 0.01
    0.0072056293 = product of:
      0.028822517 = sum of:
        0.028822517 = weight(_text_:for in 7696) [ClassicSimilarity], result of:
          0.028822517 = score(doc=7696,freq=10.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.3246967 = fieldWeight in 7696, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.0546875 = fieldNorm(doc=7696)
      0.25 = coord(1/4)
    
    Abstract
    Describes an automatic classification system for classifying chemical reactions. A detailed study of the classification of chemical reactions, based on topological and physicochemical features, is followed by an analysis of the hierarchical classification produced by the HORACE algorithm (Hierarchical Organization of Reactions through Attribute and Condition Eduction), which combines both approaches in a synergistic manner. The searching and updating of reaction hierarchies is demonstrated with the hierarchies produced for 2 data sets by the HORACE algorithm. Shows that reaction hierarchies provide an efficient access to reaction information and indicate the main reaction types for a given reaction scheme, define the scope of a reaction type, enable searchers to find unusual reactions, and can help in locating the reactions most relevant for a given problem
  3. Golub, K.; Lykke, M.: Automated classification of web pages in hierarchical browsing (2009) 0.01
    0.0069052614 = product of:
      0.027621046 = sum of:
        0.027621046 = weight(_text_:for in 3614) [ClassicSimilarity], result of:
          0.027621046 = score(doc=3614,freq=18.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.31116164 = fieldWeight in 3614, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3614)
      0.25 = coord(1/4)
    
    Abstract
    Purpose - The purpose of this study is twofold: to investigate whether it is meaningful to use the Engineering Index (Ei) classification scheme for browsing, and then, if proven useful, to investigate the performance of an automated classification algorithm based on the Ei classification scheme. Design/methodology/approach - A user study was conducted in which users solved four controlled searching tasks. The users browsed the Ei classification scheme in order to examine the suitability of the classification systems for browsing. The classification algorithm was evaluated by the users who judged the correctness of the automatically assigned classes. Findings - The study showed that the Ei classification scheme is suited for browsing. Automatically assigned classes were on average partly correct, with some classes working better than others. Success of browsing showed to be correlated and dependent on classification correctness. Research limitations/implications - Further research should address problems of disparate evaluations of one and the same web page. Additional reasons behind browsing failures in the Ei classification scheme also need further investigation. Practical implications - Improvements for browsing were identified: describing class captions and/or listing their subclasses from start; allowing for searching for words from class captions with synonym search (easily provided for Ei since the classes are mapped to thesauri terms); when searching for class captions, returning the hierarchical tree expanded around the class in which caption the search term is found. The need for improvements of classification schemes was also indicated. Originality/value - A user-based evaluation of automated subject classification in the context of browsing has not been conducted before; hence the study also presents new findings concerning methodology.
  4. HaCohen-Kerner, Y.; Beck, H.; Yehudai, E.; Rosenstein, M.; Mughaz, D.: Cuisine : classification using stylistic feature sets and/or name-based feature sets (2010) 0.01
    0.0069052614 = product of:
      0.027621046 = sum of:
        0.027621046 = weight(_text_:for in 3706) [ClassicSimilarity], result of:
          0.027621046 = score(doc=3706,freq=18.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.31116164 = fieldWeight in 3706, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3706)
      0.25 = coord(1/4)
    
    Abstract
    Document classification presents challenges due to the large number of features, their dependencies, and the large number of training documents. In this research, we investigated the use of six stylistic feature sets (including 42 features) and/or six name-based feature sets (including 234 features) for various combinations of the following classification tasks: ethnic groups of the authors and/or periods of time when the documents were written and/or places where the documents were written. The investigated corpus contains Jewish Law articles written in Hebrew-Aramaic, which present interesting problems for classification. Our system CUISINE (Classification UsIng Stylistic feature sets and/or NamE-based feature sets) achieves accuracy results between 90.71 to 98.99% for the seven classification experiments (ethnicity, time, place, ethnicity&time, ethnicity&place, time&place, ethnicity&time&place). For the first six tasks, the stylistic feature sets in general and the quantitative feature set in particular are enough for excellent classification results. In contrast, the name-based feature sets are rather poor for these tasks. However, for the most complex task (ethnicity&time&place), a hill-climbing model using all feature sets succeeds in significantly improving the classification results. Most of the stylistic features (34 of 42) are language-independent and domain-independent. These features might be useful to the community at large, at least for rather simple tasks.
    Source
    Journal of the American Society for Information Science and Technology. 61(2010) no.8, S.1644-1657
  5. Finn, A.; Kushmerick, N.: Learning to classify documents according to genre (2006) 0.01
    0.0067657465 = product of:
      0.027062986 = sum of:
        0.027062986 = weight(_text_:for in 6010) [ClassicSimilarity], result of:
          0.027062986 = score(doc=6010,freq=12.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.3048749 = fieldWeight in 6010, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.046875 = fieldNorm(doc=6010)
      0.25 = coord(1/4)
    
    Abstract
    Current document-retrieval tools succeed in locating large numbers of documents relevant to a given query. While search results may be relevant according to the topic of the documents, it is more difficult to identify which of the relevant documents are most suitable for a particular user. Automatic genre analysis (i.e., the ability to distinguish documents according to style) would be a useful tool for identifying documents that are most suitable for a particular user. We investigate the use of machine learning for automatic genre classification. We introduce the idea of domain transfer-genre classifiers should be reusable across multiple topics-which does not arise in standard text classification. We investigate different features for building genre classifiers and their ability to transfer across multiple-topic domains. We also show how different feature-sets can be used in conjunction with each other to improve performance and reduce the number of documents that need to be labeled.
    Source
    Journal of the American Society for Information Science and Technology. 57(2006) no.11, S.1506-1518
  6. Golub, K.: Automated subject classification of textual documents in the context of Web-based hierarchical browsing (2011) 0.01
    0.0067657465 = product of:
      0.027062986 = sum of:
        0.027062986 = weight(_text_:for in 4558) [ClassicSimilarity], result of:
          0.027062986 = score(doc=4558,freq=12.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.3048749 = fieldWeight in 4558, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.046875 = fieldNorm(doc=4558)
      0.25 = coord(1/4)
    
    Abstract
    While automated methods for information organization have been around for several decades now, exponential growth of the World Wide Web has put them into the forefront of research in different communities, within which several approaches can be identified: 1) machine learning (algorithms that allow computers to improve their performance based on learning from pre-existing data); 2) document clustering (algorithms for unsupervised document organization and automated topic extraction); and 3) string matching (algorithms that match given strings within larger text). Here the aim was to automatically organize textual documents into hierarchical structures for subject browsing. The string-matching approach was tested using a controlled vocabulary (containing pre-selected and pre-defined authorized terms, each corresponding to only one concept). The results imply that an appropriate controlled vocabulary, with a sufficient number of entry terms designating classes, could in itself be a solution for automated classification. Then, if the same controlled vocabulary had an appropriat hierarchical structure, it would at the same time provide a good browsing structure for the collection of automatically classified documents.
  7. Wu, M.; Liu, Y.-H.; Brownlee, R.; Zhang, X.: Evaluating utility and automatic classification of subject metadata from Research Data Australia (2021) 0.01
    0.0067657465 = product of:
      0.027062986 = sum of:
        0.027062986 = weight(_text_:for in 453) [ClassicSimilarity], result of:
          0.027062986 = score(doc=453,freq=12.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.3048749 = fieldWeight in 453, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.046875 = fieldNorm(doc=453)
      0.25 = coord(1/4)
    
    Abstract
    In this paper, we present a case study of how well subject metadata (comprising headings from an international classification scheme) has been deployed in a national data catalogue, and how often data seekers use subject metadata when searching for data. Through an analysis of user search behaviour as recorded in search logs, we find evidence that users utilise the subject metadata for data discovery. Since approximately half of the records ingested by the catalogue did not include subject metadata at the time of harvest, we experimented with automatic subject classification approaches in order to enrich these records and to provide additional support for user search and data discovery. Our results show that automatic methods work well for well represented categories of subject metadata, and these categories tend to have features that can distinguish themselves from the other categories. Our findings raise implications for data catalogue providers; they should invest more effort to enhance the quality of data records by providing an adequate description of these records for under-represented subject categories.
  8. Shafer, K.E.: Evaluating Scorpion results (1998) 0.01
    0.006510343 = product of:
      0.026041372 = sum of:
        0.026041372 = weight(_text_:for in 1569) [ClassicSimilarity], result of:
          0.026041372 = score(doc=1569,freq=4.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.29336601 = fieldWeight in 1569, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.078125 = fieldNorm(doc=1569)
      0.25 = coord(1/4)
    
    Abstract
    Scorpion is a research project at OCLC that builds tools for automatic subject assignment by combining library science and information retrieval techniques. A thesis of Scorpion is that the Dewey Decimal Classification (Dewey) can be used to perform automatic subject assignment for electronic items.
  9. Fagni, T.; Sebastiani, F.: Selecting negative examples for hierarchical text classification: An experimental comparison (2010) 0.01
    0.006510343 = product of:
      0.026041372 = sum of:
        0.026041372 = weight(_text_:for in 4101) [ClassicSimilarity], result of:
          0.026041372 = score(doc=4101,freq=16.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.29336601 = fieldWeight in 4101, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4101)
      0.25 = coord(1/4)
    
    Abstract
    Hierarchical text classification (HTC) approaches have recently attracted a lot of interest on the part of researchers in human language technology and machine learning, since they have been shown to bring about equal, if not better, classification accuracy with respect to their "flat" counterparts while allowing exponential time savings at both learning and classification time. A typical component of HTC methods is a "local" policy for selecting negative examples: Given a category c, its negative training examples are by default identified with the training examples that are negative for c and positive for the categories which are siblings of c in the hierarchy. However, this policy has always been taken for granted and never been subjected to careful scrutiny since first proposed 15 years ago. This article proposes a thorough experimental comparison between this policy and three other policies for the selection of negative examples in HTC contexts, one of which (BEST LOCAL (k)) is being proposed for the first time in this article. We compare these policies on the hierarchical versions of three supervised learning algorithms (boosting, support vector machines, and naïve Bayes) by performing experiments on two standard TC datasets, REUTERS-21578 and RCV1-V2.
    Source
    Journal of the American Society for Information Science and Technology. 61(2010) no.11, S.2256-2265
  10. Kwok, K.L.: ¬The use of titles and cited titles as document representations for automatic classification (1975) 0.01
    0.0064449105 = product of:
      0.025779642 = sum of:
        0.025779642 = weight(_text_:for in 4347) [ClassicSimilarity], result of:
          0.025779642 = score(doc=4347,freq=2.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.29041752 = fieldWeight in 4347, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.109375 = fieldNorm(doc=4347)
      0.25 = coord(1/4)
    
  11. Koch, T.; Vizine-Goetz, D.: Automatic classification and content navigation support for Web services : DESIRE II cooperates with OCLC (1998) 0.01
    0.0064449105 = product of:
      0.025779642 = sum of:
        0.025779642 = weight(_text_:for in 1568) [ClassicSimilarity], result of:
          0.025779642 = score(doc=1568,freq=8.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.29041752 = fieldWeight in 1568, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1568)
      0.25 = coord(1/4)
    
    Abstract
    Emerging standards in knowledge representation and organization are preparing the way for distributed vocabulary support in Internet search services. NetLab researchers are exploring several innovative solutions for searching and browsing in the subject-based Internet gateway, Electronic Engineering Library, Sweden (EELS). The implementation of the EELS service is described, specifically, the generation of the robot-gathered database 'All' engineering and the automated application of the Ei thesaurus and classification scheme. NetLab and OCLC researchers are collaborating to investigate advanced solutions to automated classification in the DESIRE II context. A plan for furthering the development of distributed vocabulary support in Internet search services is offered.
  12. Fong, A.C.M.: Mining a Web citation database for document clustering (2002) 0.01
    0.0064449105 = product of:
      0.025779642 = sum of:
        0.025779642 = weight(_text_:for in 3940) [ClassicSimilarity], result of:
          0.025779642 = score(doc=3940,freq=2.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.29041752 = fieldWeight in 3940, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.109375 = fieldNorm(doc=3940)
      0.25 = coord(1/4)
    
  13. Bianchini, C.; Bargioni, S.: Automated classification using linked open data : a case study on faceted classification and Wikidata (2021) 0.01
    0.0064449105 = product of:
      0.025779642 = sum of:
        0.025779642 = weight(_text_:for in 724) [ClassicSimilarity], result of:
          0.025779642 = score(doc=724,freq=8.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.29041752 = fieldWeight in 724, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.0546875 = fieldNorm(doc=724)
      0.25 = coord(1/4)
    
    Abstract
    The Wikidata gadget, CCLitBox, for the automated classification of literary authors and works by a faceted classification and using Linked Open Data (LOD) is presented. The tool reproduces the classification algorithm of class O Literature of the Colon Classification and uses data freely available in Wikidata to create Colon Classification class numbers. CCLitBox is totally free and enables any user to classify literary authors and their works; it is easily accessible to everybody; it uses LOD from Wikidata but missing data for classification can be freely added if necessary; it is readymade for any cooperative and networked project.
    Footnote
    Teil eines Themenheftes: Artificial intelligence (AI) and automated processes for subject sccess
  14. Losee, R.M.; Haas, S.W.: Sublanguage terms : dictionaries, usage, and automatic classification (1995) 0.01
    0.0063788076 = product of:
      0.02551523 = sum of:
        0.02551523 = weight(_text_:for in 2650) [ClassicSimilarity], result of:
          0.02551523 = score(doc=2650,freq=6.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.28743884 = fieldWeight in 2650, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.0625 = fieldNorm(doc=2650)
      0.25 = coord(1/4)
    
    Abstract
    The use of terms from natural and social science titles and abstracts is studied from the perspective of sublanguages and their specialized dictionaries. Explores different notions of sublanguage distinctiveness. Object methods for separating hard and soft sciences are suggested based on measures of sublanguage use, dictionary characteristics, and sublanguage distinctiveness. Abstracts were automatically classified with a high degree of accuracy by using a formula that condsiders the degree of uniqueness of terms in each sublanguage. This may prove useful for text filtering of information retrieval systems
    Source
    Journal of the American Society for Information Science. 46(1995) no.7, S.519-529
  15. Godby, C. J.; Stuler, J.: ¬The Library of Congress Classification as a knowledge base for automatic subject categorization (2001) 0.01
    0.0063788076 = product of:
      0.02551523 = sum of:
        0.02551523 = weight(_text_:for in 1567) [ClassicSimilarity], result of:
          0.02551523 = score(doc=1567,freq=6.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.28743884 = fieldWeight in 1567, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.0625 = fieldNorm(doc=1567)
      0.25 = coord(1/4)
    
    Abstract
    This paper describes a set of experiments in adapting a subset of the Library of Congress Classification for use as a database for automatic classification. A high degree of concept integrity was obtained when subject headings were mapped from OCLC's WorldCat database and filtered using the log-likelihood statistic
  16. Leroy, G.; Miller, T.; Rosemblat, G.; Browne, A.: ¬A balanced approach to health information evaluation : a vocabulary-based naïve Bayes classifier and readability formulas (2008) 0.01
    0.0061762533 = product of:
      0.024705013 = sum of:
        0.024705013 = weight(_text_:for in 1998) [ClassicSimilarity], result of:
          0.024705013 = score(doc=1998,freq=10.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.27831143 = fieldWeight in 1998, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.046875 = fieldNorm(doc=1998)
      0.25 = coord(1/4)
    
    Abstract
    Since millions seek health information online, it is vital for this information to be comprehensible. Most studies use readability formulas, which ignore vocabulary, and conclude that online health information is too difficult. We developed a vocabularly-based, naïve Bayes classifier to distinguish between three difficulty levels in text. It proved 98% accurate in a 250-document evaluation. We compared our classifier with readability formulas for 90 new documents with different origins and asked representative human evaluators, an expert and a consumer, to judge each document. Average readability grade levels for educational and commercial pages was 10th grade or higher, too difficult according to current literature. In contrast, the classifier showed that 70-90% of these pages were written at an intermediate, appropriate level indicating that vocabulary usage is frequently appropriate in text considered too difficult by readability formula evaluations. The expert considered the pages more difficult for a consumer than the consumer did.
    Source
    Journal of the American Society for Information Science and Technology. 59(2008) no.9, S.1409-1419
  17. Gauch, S.; Chandramouli, A.; Ranganathan, S.: Training a hierarchical classifier using inter document relationships (2009) 0.01
    0.0061762533 = product of:
      0.024705013 = sum of:
        0.024705013 = weight(_text_:for in 2697) [ClassicSimilarity], result of:
          0.024705013 = score(doc=2697,freq=10.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.27831143 = fieldWeight in 2697, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.046875 = fieldNorm(doc=2697)
      0.25 = coord(1/4)
    
    Abstract
    Text classifiers automatically classify documents into appropriate concepts for different applications. Most classification approaches use flat classifiers that treat each concept as independent, even when the concept space is hierarchically structured. In contrast, hierarchical text classification exploits the structural relationships between the concepts. In this article, we explore the effectiveness of hierarchical classification for a large concept hierarchy. Since the quality of the classification is dependent on the quality and quantity of the training data, we evaluate the use of documents selected from subconcepts to address the sparseness of training data for the top-level classifiers and the use of document relationships to identify the most representative training documents. By selecting training documents using structural and similarity relationships, we achieve a statistically significant improvement of 39.8% (from 54.5-76.2%) in the accuracy of the hierarchical classifier over that of the flat classifier for a large, three-level concept hierarchy.
    Source
    Journal of the American Society for Information Science and Technology. 60(2009) no.1, S.47-58
  18. Ribeiro-Neto, B.; Laender, A.H.F.; Lima, L.R.S. de: ¬An experimental study in automatically categorizing medical documents (2001) 0.01
    0.006089868 = product of:
      0.024359472 = sum of:
        0.024359472 = weight(_text_:for in 5702) [ClassicSimilarity], result of:
          0.024359472 = score(doc=5702,freq=14.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.27441877 = fieldWeight in 5702, product of:
              3.7416575 = tf(freq=14.0), with freq of:
                14.0 = termFreq=14.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5702)
      0.25 = coord(1/4)
    
    Abstract
    In this article, we evaluate the retrieval performance of an algorithm that automatically categorizes medical documents. The categorization, which consists in assigning an International Code of Disease (ICD) to the medical document under examination, is based on wellknown information retrieval techniques. The algorithm, which we proposed, operates in a fully automatic mode and requires no supervision or training data. Using a database of 20,569 documents, we verify that the algorithm attains levels of average precision in the 70-80% range for category coding and in the 60-70% range for subcategory coding. We also carefully analyze the case of those documents whose categorization is not in accordance with the one provided by the human specialists. The vast majority of them represent cases that can only be fully categorized with the assistance of a human subject (because, for instance, they require specific knowledge of a given pathology). For a slim fraction of all documents (0.77% for category coding and 1.4% for subcategory coding), the algorithm makes assignments that are clearly incorrect. However, this fraction corresponds to only one-fourth of the mistakes made by the human specialists
    Source
    Journal of the American Society for Information Science and technology. 52(2001) no.5, S.391-401
  19. Humphrey, S.M.; Névéol, A.; Browne, A.; Gobeil, J.; Ruch, P.; Darmoni, S.J.: Comparing a rule-based versus statistical system for automatic categorization of MEDLINE documents according to biomedical specialty (2009) 0.01
    0.005638122 = product of:
      0.022552488 = sum of:
        0.022552488 = weight(_text_:for in 3300) [ClassicSimilarity], result of:
          0.022552488 = score(doc=3300,freq=12.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.2540624 = fieldWeight in 3300, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3300)
      0.25 = coord(1/4)
    
    Abstract
    Automatic document categorization is an important research problem in Information Science and Natural Language Processing. Many applications, including, Word Sense Disambiguation and Information Retrieval in large collections, can benefit from such categorization. This paper focuses on automatic categorization of documents from the biomedical literature into broad discipline-based categories. Two different systems are described and contrasted: CISMeF, which uses rules based on human indexing of the documents by the Medical Subject Headings (MeSH) controlled vocabulary in order to assign metaterms (MTs), and Journal Descriptor Indexing (JDI), based on human categorization of about 4,000 journals and statistical associations between journal descriptors (JDs) and textwords in the documents. We evaluate and compare the performance of these systems against a gold standard of humanly assigned categories for 100 MEDLINE documents, using six measures selected from trec_eval. The results show that for five of the measures performance is comparable, and for one measure JDI is superior. We conclude that these results favor JDI, given the significantly greater intellectual overhead involved in human indexing and maintaining a rule base for mapping MeSH terms to MTs. We also note a JDI method that associates JDs with MeSH indexing rather than textwords, and it may be worthwhile to investigate whether this JDI method (statistical) and CISMeF (rule-based) might be combined and then evaluated showing they are complementary to one another.
    Source
    Journal of the American Society for Information Science and Technology. 60(2009) no.12, S.2530-2539
  20. Yilmaz, T.; Ozcan, R.; Altingovde, I.S.; Ulusoy, Ö.: Improving educational web search for question-like queries through subject classification (2019) 0.01
    0.005638122 = product of:
      0.022552488 = sum of:
        0.022552488 = weight(_text_:for in 5041) [ClassicSimilarity], result of:
          0.022552488 = score(doc=5041,freq=12.0), product of:
            0.08876751 = queryWeight, product of:
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.047278564 = queryNorm
            0.2540624 = fieldWeight in 5041, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.8775425 = idf(docFreq=18385, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5041)
      0.25 = coord(1/4)
    
    Abstract
    Students use general web search engines as their primary source of research while trying to find answers to school-related questions. Although search engines are highly relevant for the general population, they may return results that are out of educational context. Another rising trend; social community question answering websites are the second choice for students who try to get answers from other peers online. We attempt discovering possible improvements in educational search by leveraging both of these information sources. For this purpose, we first implement a classifier for educational questions. This classifier is built by an ensemble method that employs several regular learning algorithms and retrieval based approaches that utilize external resources. We also build a query expander to facilitate classification. We further improve the classification using search engine results and obtain 83.5% accuracy. Although our work is entirely based on the Turkish language, the features could easily be mapped to other languages as well. In order to find out whether search engine ranking can be improved in the education domain using the classification model, we collect and label a set of query results retrieved from a general web search engine. We propose five ad-hoc methods to improve search ranking based on the idea that the query-document category relation is an indicator of relevance. We evaluate these methods for overall performance, varying query length and based on factoid and non-factoid queries. We show that some of the methods significantly improve the rankings in the education domain.

Years

Languages

  • e 145
  • d 5
  • a 1
  • chi 1
  • More… Less…

Types

  • a 133
  • el 22
  • s 2
  • m 1
  • r 1
  • x 1
  • More… Less…