Search (85 results, page 2 of 5)

Golub, K.: Automated subject classification of textual Web pages, based on a controlled vocabulary : challenges and recommendations (2006) 0.01

0.0059712525 = product of:
  0.041798767 = sum of:
    0.03469929 = weight(_text_:web in 5897) [ClassicSimilarity], result of:
      0.03469929 = score(doc=5897,freq=8.0), product of:
        0.08019538 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.024573348 = queryNorm
        0.43268442 = fieldWeight in 5897, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=5897)
    0.007099477 = weight(_text_:information in 5897) [ClassicSimilarity], result of:
      0.007099477 = score(doc=5897,freq=4.0), product of:
        0.04313797 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.024573348 = queryNorm
        0.16457605 = fieldWeight in 5897, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=5897)
  0.14285715 = coord(2/14)

Abstract: The primary objective of this study was to identify and address problems of applying a controlled vocabulary in automated subject classification of textual Web pages, in the area of engineering. Web pages have special characteristics such as structural information, but are at the same time rather heterogeneous. The classification approach used comprises string-to-string matching between words in a term list extracted from the Ei (Engineering Information) thesaurus and classification scheme, and words in the text to be classified. Based on a sample of 70 Web pages, a number of problems with the term list are identified. Reasons for those problems are discussed and improvements proposed. Methods for implementing the improvements are also specified, suggesting further research.

Schek, M.: Automatische Klassifizierung in Erschließung und Recherche eines Pressearchivs (2006) 0.01

0.0059066126 = product of:
  0.041346285 = sum of:
    0.031409275 = weight(_text_:indexierung in 6043) [ClassicSimilarity], result of:
      0.031409275 = score(doc=6043,freq=2.0), product of:
        0.13215348 = queryWeight, product of:
          5.377919 = idf(docFreq=554, maxDocs=44218)
          0.024573348 = queryNorm
        0.23767269 = fieldWeight in 6043, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.377919 = idf(docFreq=554, maxDocs=44218)
          0.03125 = fieldNorm(doc=6043)
    0.00993701 = weight(_text_:retrieval in 6043) [ClassicSimilarity], result of:
      0.00993701 = score(doc=6043,freq=2.0), product of:
        0.07433229 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.024573348 = queryNorm
        0.13368362 = fieldWeight in 6043, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.03125 = fieldNorm(doc=6043)
  0.14285715 = coord(2/14)

Theme: Semantisches Umfeld in Indexierung u. Retrieval

Yu, W.; Gong, Y.: Document clustering by concept factorization (2004) 0.01

0.005693029 = product of:
  0.039851204 = sum of:
    0.010040177 = weight(_text_:information in 4084) [ClassicSimilarity], result of:
      0.010040177 = score(doc=4084,freq=2.0), product of:
        0.04313797 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.024573348 = queryNorm
        0.23274569 = fieldWeight in 4084, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.09375 = fieldNorm(doc=4084)
    0.029811028 = weight(_text_:retrieval in 4084) [ClassicSimilarity], result of:
      0.029811028 = score(doc=4084,freq=2.0), product of:
        0.07433229 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.024573348 = queryNorm
        0.40105087 = fieldWeight in 4084, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.09375 = fieldNorm(doc=4084)
  0.14285715 = coord(2/14)

Source: SIGIR'04: Proceedings of the 27th Annual International ACM-SIGIR Conference an Research and Development in Information Retrieval. Ed.: K. Järvelin, u.a

Chung, Y.-M.; Noh, Y.-H.: Developing a specialized directory system by automatically classifying Web documents (2003) 0.01
```
0.0056741973 = product of:
  0.03971938 = sum of:
    0.03469929 = weight(_text_:web in 1566) [ClassicSimilarity], result of:
      0.03469929 = score(doc=1566,freq=8.0), product of:
        0.08019538 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.024573348 = queryNorm
        0.43268442 = fieldWeight in 1566, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=1566)
    0.0050200885 = weight(_text_:information in 1566) [ClassicSimilarity], result of:
      0.0050200885 = score(doc=1566,freq=2.0), product of:
        0.04313797 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.024573348 = queryNorm
        0.116372846 = fieldWeight in 1566, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=1566)
  0.14285715 = coord(2/14)
```
Abstract

This study developed a specialized directory system using an automatic classification technique. Economics was selected as the subject field for the classification experiments with Web documents. The classification scheme of the directory follows the DDC, and subject terms representing each class number or subject category were selected from the DDC table to construct a representative term dictionary. In collecting and classifying the Web documents, various strategies were tested in order to find the optimal thresholds. In the classification experiments, Web documents in economics were classified into a total of 757 hierarchical subject categories built from the DDC scheme. The first and second experiments using the representative term dictionary resulted in relatively high precision ratios of 77 and 60%, respectively. The third experiment employing a machine learning-based k-nearest neighbours (kNN) classifier in a closed experimental setting achieved a precision ratio of 96%. This implies that it is possible to enhance the classification performance by applying a hybrid method combining a dictionary-based technique and a kNN classifier

Source

Journal of information science. 29(2003) no.2, S.117-126

Guerrero-Bote, V.P.; Moya Anegón, F. de; Herrero Solana, V.: Document organization using Kohonen's algorithm (2002) 0.01

0.005671358 = product of:
  0.039699506 = sum of:
    0.011593399 = weight(_text_:information in 2564) [ClassicSimilarity], result of:
      0.011593399 = score(doc=2564,freq=6.0), product of:
        0.04313797 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.024573348 = queryNorm
        0.2687516 = fieldWeight in 2564, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0625 = fieldNorm(doc=2564)
    0.028106106 = weight(_text_:retrieval in 2564) [ClassicSimilarity], result of:
      0.028106106 = score(doc=2564,freq=4.0), product of:
        0.07433229 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.024573348 = queryNorm
        0.37811437 = fieldWeight in 2564, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0625 = fieldNorm(doc=2564)
  0.14285715 = coord(2/14)

Abstract: The classification of documents from a bibliographic database is a task that is linked to processes of information retrieval based on partial matching. A method is described of vectorizing reference documents from LISA which permits their topological organization using Kohonen's algorithm. As an example a map is generated of 202 documents from LISA, and an analysis is made of the possibilities of this type of neural network with respect to the development of information retrieval systems based on graphical browsing.
Source: Information processing and management. 38(2002) no.1, S.79-89

Schek, M.: Automatische Klassifizierung und Visualisierung im Archiv der Süddeutschen Zeitung (2005) 0.01

0.005168285 = product of:
  0.036177997 = sum of:
    0.027483113 = weight(_text_:indexierung in 4884) [ClassicSimilarity], result of:
      0.027483113 = score(doc=4884,freq=2.0), product of:
        0.13215348 = queryWeight, product of:
          5.377919 = idf(docFreq=554, maxDocs=44218)
          0.024573348 = queryNorm
        0.2079636 = fieldWeight in 4884, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.377919 = idf(docFreq=554, maxDocs=44218)
          0.02734375 = fieldNorm(doc=4884)
    0.008694883 = weight(_text_:retrieval in 4884) [ClassicSimilarity], result of:
      0.008694883 = score(doc=4884,freq=2.0), product of:
        0.07433229 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.024573348 = queryNorm
        0.11697317 = fieldWeight in 4884, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.02734375 = fieldNorm(doc=4884)
  0.14285715 = coord(2/14)

Theme: Semantisches Umfeld in Indexierung u. Retrieval

Prabowo, R.; Jackson, M.; Burden, P.; Knoell, H.-D.: Ontology-based automatic classification for the Web pages : design, implementation and evaluation (2002) 0.01
```
0.0050100805 = product of:
  0.03507056 = sum of:
    0.030050473 = weight(_text_:web in 3383) [ClassicSimilarity], result of:
      0.030050473 = score(doc=3383,freq=6.0), product of:
        0.08019538 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.024573348 = queryNorm
        0.37471575 = fieldWeight in 3383, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=3383)
    0.0050200885 = weight(_text_:information in 3383) [ClassicSimilarity], result of:
      0.0050200885 = score(doc=3383,freq=2.0), product of:
        0.04313797 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.024573348 = queryNorm
        0.116372846 = fieldWeight in 3383, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=3383)
  0.14285715 = coord(2/14)
```
Abstract

In recent years, we have witnessed the continual growth in the use of ontologies in order to provide a mechanism to enable machine reasoning. This paper describes an automatic classifier, which focuses on the use of ontologies for classifying Web pages with respect to the Dewey Decimal Classification (DDC) and Library of Congress Classification (LCC) schemes. Firstly, we explain how these ontologies can be built in a modular fashion, and mapped into DDC and LCC. Secondly, we propose the formal definition of a DDC-LCC and an ontology-classification-scheme mapping. Thirdly, we explain the way the classifier uses these ontologies to assist classification. Finally, an experiment in which the accuracy of the classifier was evaluated is presented. The experiment shows that our approach results an improved classification in terms of accuracy. This improvement, however, comes at a cost in a low overage ratio due to the incompleteness of the ontologies used

Content

Beitrag bei: The Third International Conference on Web Information Systems Engineering (WISE'00) Dec., 12-14, 2002, Singapore, S.182.

Hung, C.-M.; Chien, L.-F.: Web-based text classification in the absence of manually labeled training documents (2007) 0.01

0.0050100805 = product of:
  0.03507056 = sum of:
    0.030050473 = weight(_text_:web in 87) [ClassicSimilarity], result of:
      0.030050473 = score(doc=87,freq=6.0), product of:
        0.08019538 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.024573348 = queryNorm
        0.37471575 = fieldWeight in 87, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=87)
    0.0050200885 = weight(_text_:information in 87) [ClassicSimilarity], result of:
      0.0050200885 = score(doc=87,freq=2.0), product of:
        0.04313797 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.024573348 = queryNorm
        0.116372846 = fieldWeight in 87, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=87)
  0.14285715 = coord(2/14)

Abstract: Most text classification techniques assume that manually labeled documents (corpora) can be easily obtained while learning text classifiers. However, labeled training documents are sometimes unavailable or inadequate even if they are available. The goal of this article is to present a self-learned approach to extract high-quality training documents from the Web when the required manually labeled documents are unavailable or of poor quality. To learn a text classifier automatically, we need only a set of user-defined categories and some highly related keywords. Extensive experiments are conducted to evaluate the performance of the proposed approach using the test set from the Reuters-21578 news data set. The experiments show that very promising results can be achieved only by using automatically extracted documents from the Web.
Source: Journal of the American Society for Information Science and Technology. 58(2007) no.1, S.88-96

Pfeffer, M.: Automatische Vergabe von RVK-Notationen mittels fallbasiertem Schließen (2009) 0.00

0.0049712714 = product of:
  0.034798898 = sum of:
    0.0281402 = weight(_text_:frankfurt in 3051) [ClassicSimilarity], result of:
      0.0281402 = score(doc=3051,freq=2.0), product of:
        0.10213336 = queryWeight, product of:
          4.1562657 = idf(docFreq=1882, maxDocs=44218)
          0.024573348 = queryNorm
        0.27552408 = fieldWeight in 3051, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.1562657 = idf(docFreq=1882, maxDocs=44218)
          0.046875 = fieldNorm(doc=3051)
    0.006658699 = product of:
      0.019976096 = sum of:
        0.019976096 = weight(_text_:22 in 3051) [ClassicSimilarity], result of:
          0.019976096 = score(doc=3051,freq=2.0), product of:
            0.08605168 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.024573348 = queryNorm
            0.23214069 = fieldWeight in 3051, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=3051)
      0.33333334 = coord(1/3)
  0.14285715 = coord(2/14)

Date: 22. 8.2009 19:51:28
Imprint: Frankfurt, M. : Klostermann

Cui, H.; Heidorn, P.B.; Zhang, H.: ¬An approach to automatic classification of text for information retrieval (2002) 0.00

0.00469651 = product of:
  0.032875568 = sum of:
    0.008282723 = weight(_text_:information in 174) [ClassicSimilarity], result of:
      0.008282723 = score(doc=174,freq=4.0), product of:
        0.04313797 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.024573348 = queryNorm
        0.1920054 = fieldWeight in 174, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0546875 = fieldNorm(doc=174)
    0.024592843 = weight(_text_:retrieval in 174) [ClassicSimilarity], result of:
      0.024592843 = score(doc=174,freq=4.0), product of:
        0.07433229 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.024573348 = queryNorm
        0.33085006 = fieldWeight in 174, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0546875 = fieldNorm(doc=174)
  0.14285715 = coord(2/14)

Abstract: In this paper, we explore an approach to make better use of semi-structured documents in information retrieval in the domain of biology. Using machine learning techniques, we make those inherent structures explicit by XML markups. This marking up has great potentials in improving task performance in specimen identification and the usability of online flora and fauna.

Golub, K.; Lykke, M.: Automated classification of web pages in hierarchical browsing (2009) 0.00
```
0.0046954313 = product of:
  0.032868017 = sum of:
    0.020446755 = weight(_text_:web in 3614) [ClassicSimilarity], result of:
      0.020446755 = score(doc=3614,freq=4.0), product of:
        0.08019538 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.024573348 = queryNorm
        0.25496176 = fieldWeight in 3614, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3614)
    0.012421262 = weight(_text_:retrieval in 3614) [ClassicSimilarity], result of:
      0.012421262 = score(doc=3614,freq=2.0), product of:
        0.07433229 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.024573348 = queryNorm
        0.16710453 = fieldWeight in 3614, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3614)
  0.14285715 = coord(2/14)
```
Abstract

Purpose - The purpose of this study is twofold: to investigate whether it is meaningful to use the Engineering Index (Ei) classification scheme for browsing, and then, if proven useful, to investigate the performance of an automated classification algorithm based on the Ei classification scheme. Design/methodology/approach - A user study was conducted in which users solved four controlled searching tasks. The users browsed the Ei classification scheme in order to examine the suitability of the classification systems for browsing. The classification algorithm was evaluated by the users who judged the correctness of the automatically assigned classes. Findings - The study showed that the Ei classification scheme is suited for browsing. Automatically assigned classes were on average partly correct, with some classes working better than others. Success of browsing showed to be correlated and dependent on classification correctness. Research limitations/implications - Further research should address problems of disparate evaluations of one and the same web page. Additional reasons behind browsing failures in the Ei classification scheme also need further investigation. Practical implications - Improvements for browsing were identified: describing class captions and/or listing their subclasses from start; allowing for searching for words from class captions with synonym search (easily provided for Ei since the classes are mapped to thesauri terms); when searching for class captions, returning the hierarchical tree expanded around the class in which caption the search term is found. The need for improvements of classification schemes was also indicated. Originality/value - A user-based evaluation of automated subject classification in the context of browsing has not been conducted before; hence the study also presents new findings concerning methodology.

Theme

Klassifikationssysteme im Online-Retrieval

Pfister, J.: Clustering von Patent-Dokumenten am Beispiel der Datenbanken des Fachinformationszentrums Karlsruhe (2006) 0.00

0.003795353 = product of:
  0.02656747 = sum of:
    0.006693451 = weight(_text_:information in 5976) [ClassicSimilarity], result of:
      0.006693451 = score(doc=5976,freq=2.0), product of:
        0.04313797 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.024573348 = queryNorm
        0.1551638 = fieldWeight in 5976, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0625 = fieldNorm(doc=5976)
    0.01987402 = weight(_text_:retrieval in 5976) [ClassicSimilarity], result of:
      0.01987402 = score(doc=5976,freq=2.0), product of:
        0.07433229 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.024573348 = queryNorm
        0.26736724 = fieldWeight in 5976, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0625 = fieldNorm(doc=5976)
  0.14285715 = coord(2/14)

Source: Effektive Information Retrieval Verfahren in Theorie und Praxis: ausgewählte und erweiterte Beiträge des Vierten Hildesheimer Evaluierungs- und Retrievalworkshop (HIER 2005), Hildesheim, 20.7.2005. Hrsg.: T. Mandl u. C. Womser-Hacker

Peng, F.; Huang, X.: Machine learning for Asian language text classification (2007) 0.00
```
0.003766141 = product of:
  0.026362985 = sum of:
    0.020446755 = weight(_text_:web in 831) [ClassicSimilarity], result of:
      0.020446755 = score(doc=831,freq=4.0), product of:
        0.08019538 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.024573348 = queryNorm
        0.25496176 = fieldWeight in 831, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.0390625 = fieldNorm(doc=831)
    0.005916231 = weight(_text_:information in 831) [ClassicSimilarity], result of:
      0.005916231 = score(doc=831,freq=4.0), product of:
        0.04313797 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.024573348 = queryNorm
        0.13714671 = fieldWeight in 831, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=831)
  0.14285715 = coord(2/14)
```
Abstract

Purpose - The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task. Design/methodology/approach - Naïve Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentation-based approach was compared with the non-segmentation-based approach. Findings - There were two findings: the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features; and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy. Practical implications - Apply the findings to real web text classification is ongoing work. Originality/value - The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.
Yoon, Y.; Lee, G.G.: Efficient implementation of associative classifiers for document classification (2007) 0.00
```
0.0033715093 = product of:
  0.023600563 = sum of:
    0.008695048 = weight(_text_:information in 909) [ClassicSimilarity], result of:
      0.008695048 = score(doc=909,freq=6.0), product of:
        0.04313797 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.024573348 = queryNorm
        0.20156369 = fieldWeight in 909, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=909)
    0.014905514 = weight(_text_:retrieval in 909) [ClassicSimilarity], result of:
      0.014905514 = score(doc=909,freq=2.0), product of:
        0.07433229 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.024573348 = queryNorm
        0.20052543 = fieldWeight in 909, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.046875 = fieldNorm(doc=909)
  0.14285715 = coord(2/14)
```
Abstract

In practical text classification tasks, the ability to interpret the classification result is as important as the ability to classify exactly. Associative classifiers have many favorable characteristics such as rapid training, good classification accuracy, and excellent interpretation. However, associative classifiers also have some obstacles to overcome when they are applied in the area of text classification. The target text collection generally has a very high dimension, thus the training process might take a very long time. We propose a feature selection based on the mutual information between the word and class variables to reduce the space dimension of the associative classifiers. In addition, the training process of the associative classifier produces a huge amount of classification rules, which makes the prediction with a new document ineffective. We resolve this by introducing a new efficient method for storing and pruning classification rules. This method can also be used when predicting a test document. Experimental results using the 20-newsgroups dataset show many benefits of the associative classification in both training and predicting when applied to a real world problem.

Footnote

Beitrag in: Special issue on AIRS2005: Information Retrieval Research in Asia

Source

Information processing and management. 43(2007) no.2, S.393-405
Ribeiro-Neto, B.; Laender, A.H.F.; Lima, L.R.S. de: ¬An experimental study in automatically categorizing medical documents (2001) 0.00
```
0.0033546495 = product of:
  0.023482546 = sum of:
    0.005916231 = weight(_text_:information in 5702) [ClassicSimilarity], result of:
      0.005916231 = score(doc=5702,freq=4.0), product of:
        0.04313797 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.024573348 = queryNorm
        0.13714671 = fieldWeight in 5702, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5702)
    0.017566316 = weight(_text_:retrieval in 5702) [ClassicSimilarity], result of:
      0.017566316 = score(doc=5702,freq=4.0), product of:
        0.07433229 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.024573348 = queryNorm
        0.23632148 = fieldWeight in 5702, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5702)
  0.14285715 = coord(2/14)
```
Abstract

In this article, we evaluate the retrieval performance of an algorithm that automatically categorizes medical documents. The categorization, which consists in assigning an International Code of Disease (ICD) to the medical document under examination, is based on wellknown information retrieval techniques. The algorithm, which we proposed, operates in a fully automatic mode and requires no supervision or training data. Using a database of 20,569 documents, we verify that the algorithm attains levels of average precision in the 70-80% range for category coding and in the 60-70% range for subcategory coding. We also carefully analyze the case of those documents whose categorization is not in accordance with the one provided by the human specialists. The vast majority of them represent cases that can only be fully categorized with the assistance of a human subject (because, for instance, they require specific knowledge of a given pathology). For a slim fraction of all documents (0.77% for category coding and 1.4% for subcategory coding), the algorithm makes assignments that are clearly incorrect. However, this fraction corresponds to only one-fourth of the mistakes made by the human specialists

Source

Journal of the American Society for Information Science and technology. 52(2001) no.5, S.391-401

Godby, C.J.; Stuler, J.: ¬The Library of Congress Classification as a knowledge base for automatic subject categorization : subject access issues (2003) 0.00

0.003320934 = product of:
  0.023246538 = sum of:
    0.00585677 = weight(_text_:information in 3962) [ClassicSimilarity], result of:
      0.00585677 = score(doc=3962,freq=2.0), product of:
        0.04313797 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.024573348 = queryNorm
        0.13576832 = fieldWeight in 3962, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0546875 = fieldNorm(doc=3962)
    0.017389767 = weight(_text_:retrieval in 3962) [ClassicSimilarity], result of:
      0.017389767 = score(doc=3962,freq=2.0), product of:
        0.07433229 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.024573348 = queryNorm
        0.23394634 = fieldWeight in 3962, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0546875 = fieldNorm(doc=3962)
  0.14285715 = coord(2/14)

Source: Subject retrieval in a networked environment: Proceedings of the IFLA Satellite Meeting held in Dublin, OH, 14-16 August 2001 and sponsored by the IFLA Classification and Indexing Section, the IFLA Information Technology Section and OCLC. Ed.: I.C. McIlwaine

Sebastiani, F.: Classification of text, automatic (2006) 0.00

0.003320934 = product of:
  0.023246538 = sum of:
    0.00585677 = weight(_text_:information in 5003) [ClassicSimilarity], result of:
      0.00585677 = score(doc=5003,freq=2.0), product of:
        0.04313797 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.024573348 = queryNorm
        0.13576832 = fieldWeight in 5003, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5003)
    0.017389767 = weight(_text_:retrieval in 5003) [ClassicSimilarity], result of:
      0.017389767 = score(doc=5003,freq=2.0), product of:
        0.07433229 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.024573348 = queryNorm
        0.23394634 = fieldWeight in 5003, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5003)
  0.14285715 = coord(2/14)

Abstract: Automatic text classification (ATC) is a discipline at the crossroads of information retrieval (IR), machine learning (ML), and computational linguistics (CL), and consists in the realization of text classifiers, i.e. software systems capable of assigning texts to one or more categories, or classes, from a predefined set. Applications range from the automated indexing of scientific articles, to e-mail routing, spam filtering, authorship attribution, and automated survey coding. This article will focus on the ML approach to ATC, whereby a software system (called the learner) automatically builds a classifier for the categories of interest by generalizing from a "training" set of pre-classified texts.

Chung, Y.M.; Lee, J.Y.: ¬A corpus-based approach to comparative evaluation of statistical term association measures (2001) 0.00
```
0.0029697255 = product of:
  0.020788077 = sum of:
    0.008366814 = weight(_text_:information in 5769) [ClassicSimilarity], result of:
      0.008366814 = score(doc=5769,freq=8.0), product of:
        0.04313797 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.024573348 = queryNorm
        0.19395474 = fieldWeight in 5769, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5769)
    0.012421262 = weight(_text_:retrieval in 5769) [ClassicSimilarity], result of:
      0.012421262 = score(doc=5769,freq=2.0), product of:
        0.07433229 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.024573348 = queryNorm
        0.16710453 = fieldWeight in 5769, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5769)
  0.14285715 = coord(2/14)
```
Abstract

Statistical association measures have been widely applied in information retrieval research, usually employing a clustering of documents or terms on the basis of their relationships. Applications of the association measures for term clustering include automatic thesaurus construction and query expansion. This research evaluates the similarity of six association measures by comparing the relationship and behavior they demonstrate in various analyses of a test corpus. Analysis techniques include comparisons of highly ranked term pairs and term clusters, analyses of the correlation among the association measures using Pearson's correlation coefficient and MDS mapping, and an analysis of the impact of a term frequency on the association values by means of z-score. The major findings of the study are as follows: First, the most similar association measures are mutual information and Yule's coefficient of colligation Y, whereas cosine and Jaccard coefficients, as well as X**2 statistic and likelihood ratio, demonstrate quite similar behavior for terms with high frequency. Second, among all the measures, the X**2 statistic is the least affected by the frequency of terms. Third, although cosine and Jaccard coefficients tend to emphasize high frequency terms, mutual information and Yule's Y seem to overestimate rare terms

Source

Journal of the American Society for Information Science and technology. 52(2001) no.4, S.283-296

Fong, A.C.M.: Mining a Web citation database for document clustering (2002) 0.00

0.002891608 = product of:
  0.04048251 = sum of:
    0.04048251 = weight(_text_:web in 3940) [ClassicSimilarity], result of:
      0.04048251 = score(doc=3940,freq=2.0), product of:
        0.08019538 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.024573348 = queryNorm
        0.50479853 = fieldWeight in 3940, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.109375 = fieldNorm(doc=3940)
  0.071428575 = coord(1/14)

Lindholm, J.; Schönthal, T.; Jansson , K.: Experiences of harvesting Web resources in engineering using automatic classification (2003) 0.00

0.0028619498 = product of:
  0.040067296 = sum of:
    0.040067296 = weight(_text_:web in 4088) [ClassicSimilarity], result of:
      0.040067296 = score(doc=4088,freq=6.0), product of:
        0.08019538 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.024573348 = queryNorm
        0.49962097 = fieldWeight in 4088, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.0625 = fieldNorm(doc=4088)
  0.071428575 = coord(1/14)

Abstract: Authors describe the background and the work involved in setting up Engine-e, a Web index that uses automatic classification as a mean for the selection of resources in Engineering. Considerations in offering a robot-generated Web index as a successor to a manually indexed quality-controlled subject gateway are also discussed

Search (85 results, page 2 of 5)

Authors

Languages

Types

Themes

Subjects