Search (28 results, page 2 of 2)

Golub, K.: Automated subject classification of textual web documents (2006) 0.00
```
0.0044974573 = product of:
  0.01798983 = sum of:
    0.01798983 = product of:
      0.03597966 = sum of:
        0.03597966 = weight(_text_:design in 5600) [ClassicSimilarity], result of:
          0.03597966 = score(doc=5600,freq=2.0), product of:
            0.17322445 = queryWeight, product of:
              3.7598698 = idf(docFreq=2798, maxDocs=44218)
              0.046071928 = queryNorm
            0.20770542 = fieldWeight in 5600, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.7598698 = idf(docFreq=2798, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5600)
      0.5 = coord(1/2)
  0.25 = coord(1/4)
```
Abstract

Purpose - To provide an integrated perspective to similarities and differences between approaches to automated classification in different research communities (machine learning, information retrieval and library science), and point to problems with the approaches and automated classification as such. Design/methodology/approach - A range of works dealing with automated classification of full-text web documents are discussed. Explorations of individual approaches are given in the following sections: special features (description, differences, evaluation), application and characteristics of web pages. Findings - Provides major similarities and differences between the three approaches: document pre-processing and utilization of web-specific document characteristics is common to all the approaches; major differences are in applied algorithms, employment or not of the vector space model and of controlled vocabularies. Problems of automated classification are recognized. Research limitations/implications - The paper does not attempt to provide an exhaustive bibliography of related resources. Practical implications - As an integrated overview of approaches from different research communities with application examples, it is very useful for students in library and information science and computer science, as well as for practitioners. Researchers from one community have the information on how similar tasks are conducted in different communities. Originality/value - To the author's knowledge, no review paper on automated text classification attempted to discuss more than one community's approach from an integrated perspective.
Peng, F.; Huang, X.: Machine learning for Asian language text classification (2007) 0.00
```
0.0044974573 = product of:
  0.01798983 = sum of:
    0.01798983 = product of:
      0.03597966 = sum of:
        0.03597966 = weight(_text_:design in 831) [ClassicSimilarity], result of:
          0.03597966 = score(doc=831,freq=2.0), product of:
            0.17322445 = queryWeight, product of:
              3.7598698 = idf(docFreq=2798, maxDocs=44218)
              0.046071928 = queryNorm
            0.20770542 = fieldWeight in 831, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.7598698 = idf(docFreq=2798, maxDocs=44218)
              0.0390625 = fieldNorm(doc=831)
      0.5 = coord(1/2)
  0.25 = coord(1/4)
```
Abstract

Purpose - The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task. Design/methodology/approach - Naïve Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentation-based approach was compared with the non-segmentation-based approach. Findings - There were two findings: the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features; and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy. Practical implications - Apply the findings to real web text classification is ongoing work. Originality/value - The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.
Pong, J.Y.-H.; Kwok, R.C.-W.; Lau, R.Y.-K.; Hao, J.-X.; Wong, P.C.-C.: ¬A comparative study of two automatic document classification methods in a library setting (2008) 0.00
```
0.0044974573 = product of:
  0.01798983 = sum of:
    0.01798983 = product of:
      0.03597966 = sum of:
        0.03597966 = weight(_text_:design in 2532) [ClassicSimilarity], result of:
          0.03597966 = score(doc=2532,freq=2.0), product of:
            0.17322445 = queryWeight, product of:
              3.7598698 = idf(docFreq=2798, maxDocs=44218)
              0.046071928 = queryNorm
            0.20770542 = fieldWeight in 2532, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.7598698 = idf(docFreq=2798, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2532)
      0.5 = coord(1/2)
  0.25 = coord(1/4)
```
Abstract

In current library practice, trained human experts usually carry out document cataloguing and indexing based on a manual approach. With the explosive growth in the number of electronic documents available on the Internet and digital libraries, it is increasingly difficult for library practitioners to categorize both electronic documents and traditional library materials using just a manual approach. To improve the effectiveness and efficiency of document categorization at the library setting, more in-depth studies of using automatic document classification methods to categorize library items are required. Machine learning research has advanced rapidly in recent years. However, applying machine learning techniques to improve library practice is still a relatively unexplored area. This paper illustrates the design and development of a machine learning based automatic document classification system to alleviate the manual categorization problem encountered within the library setting. Two supervised machine learning algorithms have been tested. Our empirical tests show that supervised machine learning algorithms in general, and the k-nearest neighbours (KNN) algorithm in particular, can be used to develop an effective document classification system to enhance current library practice. Moreover, some concrete recommendations regarding how to practically apply the KNN algorithm to develop automatic document classification in a library setting are made. To our best knowledge, this is the first in-depth study of applying the KNN algorithm to automatic document classification based on the widely used LCC classification scheme adopted by many large libraries.
Golub, K.; Lykke, M.: Automated classification of web pages in hierarchical browsing (2009) 0.00
```
0.0044974573 = product of:
  0.01798983 = sum of:
    0.01798983 = product of:
      0.03597966 = sum of:
        0.03597966 = weight(_text_:design in 3614) [ClassicSimilarity], result of:
          0.03597966 = score(doc=3614,freq=2.0), product of:
            0.17322445 = queryWeight, product of:
              3.7598698 = idf(docFreq=2798, maxDocs=44218)
              0.046071928 = queryNorm
            0.20770542 = fieldWeight in 3614, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.7598698 = idf(docFreq=2798, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3614)
      0.5 = coord(1/2)
  0.25 = coord(1/4)
```
Abstract

Purpose - The purpose of this study is twofold: to investigate whether it is meaningful to use the Engineering Index (Ei) classification scheme for browsing, and then, if proven useful, to investigate the performance of an automated classification algorithm based on the Ei classification scheme. Design/methodology/approach - A user study was conducted in which users solved four controlled searching tasks. The users browsed the Ei classification scheme in order to examine the suitability of the classification systems for browsing. The classification algorithm was evaluated by the users who judged the correctness of the automatically assigned classes. Findings - The study showed that the Ei classification scheme is suited for browsing. Automatically assigned classes were on average partly correct, with some classes working better than others. Success of browsing showed to be correlated and dependent on classification correctness. Research limitations/implications - Further research should address problems of disparate evaluations of one and the same web page. Additional reasons behind browsing failures in the Ei classification scheme also need further investigation. Practical implications - Improvements for browsing were identified: describing class captions and/or listing their subclasses from start; allowing for searching for words from class captions with synonym search (easily provided for Ei since the classes are mapped to thesauri terms); when searching for class captions, returning the hierarchical tree expanded around the class in which caption the search term is found. The need for improvements of classification schemes was also indicated. Originality/value - A user-based evaluation of automated subject classification in the context of browsing has not been conducted before; hence the study also presents new findings concerning methodology.
Salles, T.; Rocha, L.; Gonçalves, M.A.; Almeida, J.M.; Mourão, F.; Meira Jr., W.; Viegas, F.: ¬A quantitative analysis of the temporal effects on automatic text classification (2016) 0.00
```
0.0044974573 = product of:
  0.01798983 = sum of:
    0.01798983 = product of:
      0.03597966 = sum of:
        0.03597966 = weight(_text_:design in 3014) [ClassicSimilarity], result of:
          0.03597966 = score(doc=3014,freq=2.0), product of:
            0.17322445 = queryWeight, product of:
              3.7598698 = idf(docFreq=2798, maxDocs=44218)
              0.046071928 = queryNorm
            0.20770542 = fieldWeight in 3014, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.7598698 = idf(docFreq=2798, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3014)
      0.5 = coord(1/2)
  0.25 = coord(1/4)
```
Abstract

Automatic text classification (TC) continues to be a relevant research topic and several TC algorithms have been proposed. However, the majority of TC algorithms assume that the underlying data distribution does not change over time. In this work, we are concerned with the challenges imposed by the temporal dynamics observed in textual data sets. We provide evidence of the existence of temporal effects in three textual data sets, reflected by variations observed over time in the class distribution, in the pairwise class similarities, and in the relationships between terms and classes. We then quantify, using a series of full factorial design experiments, the impact of these effects on four well-known TC algorithms. We show that these temporal effects affect each analyzed data set differently and that they restrict the performance of each considered TC algorithm to different extents. The reported quantitative analyses, which are the original contributions of this article, provide valuable new insights to better understand the behavior of TC algorithms when faced with nonstatic (temporal) data distributions and highlight important requirements for the proposal of more accurate classification models.

Mengle, S.; Goharian, N.: Passage detection using text classification (2009) 0.00

0.003901319 = product of:
  0.015605276 = sum of:
    0.015605276 = product of:
      0.031210553 = sum of:
        0.031210553 = weight(_text_:22 in 2765) [ClassicSimilarity], result of:
          0.031210553 = score(doc=2765,freq=2.0), product of:
            0.16133605 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046071928 = queryNorm
            0.19345059 = fieldWeight in 2765, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2765)
      0.5 = coord(1/2)
  0.25 = coord(1/4)

Date: 22. 3.2009 19:14:43

Liu, R.-L.: ¬A passage extractor for classification of disease aspect information (2013) 0.00

0.003901319 = product of:
  0.015605276 = sum of:
    0.015605276 = product of:
      0.031210553 = sum of:
        0.031210553 = weight(_text_:22 in 1107) [ClassicSimilarity], result of:
          0.031210553 = score(doc=1107,freq=2.0), product of:
            0.16133605 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046071928 = queryNorm
            0.19345059 = fieldWeight in 1107, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1107)
      0.5 = coord(1/2)
  0.25 = coord(1/4)

Date: 28.10.2013 19:22:57

Khoo, C.S.G.; Ng, K.; Ou, S.: ¬An exploratory study of human clustering of Web pages (2003) 0.00

0.0031210552 = product of:
  0.012484221 = sum of:
    0.012484221 = product of:
      0.024968442 = sum of:
        0.024968442 = weight(_text_:22 in 2741) [ClassicSimilarity], result of:
          0.024968442 = score(doc=2741,freq=2.0), product of:
            0.16133605 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046071928 = queryNorm
            0.15476047 = fieldWeight in 2741, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03125 = fieldNorm(doc=2741)
      0.5 = coord(1/2)
  0.25 = coord(1/4)

Date: 12. 9.2004 9:56:22

Search (28 results, page 2 of 2)

Authors

Years

Types

Themes