Search (137 results, page 1 of 7)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.12

0.11885393 = product of:
  0.35656178 = sum of:
    0.034793857 = product of:
      0.10438157 = sum of:
        0.10438157 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
          0.10438157 = score(doc=562,freq=2.0), product of:
            0.18572637 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.021906832 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.33333334 = coord(1/3)
    0.10438157 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.10438157 = score(doc=562,freq=2.0), product of:
        0.18572637 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.021906832 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.10438157 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.10438157 = score(doc=562,freq=2.0), product of:
        0.18572637 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.021906832 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.10438157 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.10438157 = score(doc=562,freq=2.0), product of:
        0.18572637 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.021906832 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.0026870528 = weight(_text_:in in 562) [ClassicSimilarity], result of:
      0.0026870528 = score(doc=562,freq=2.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.09017298 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.0059361467 = product of:
      0.01780844 = sum of:
        0.01780844 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
          0.01780844 = score(doc=562,freq=2.0), product of:
            0.076713994 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.021906832 = queryNorm
            0.23214069 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.33333334 = coord(1/3)
  0.33333334 = coord(6/18)

Abstract: Document representations for text classification are typically based on the classical Bag-Of-Words paradigm. This approach comes with deficiencies that motivate the integration of features on a higher semantic level than single words. In this paper we propose an enhancement of the classical document representation through concepts extracted from background knowledge. Boosting is used for actual classification. Experimental evaluations on two well known text corpora support our approach through consistent improvement of the results.
Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32

HaCohen-Kerner, Y. et al.: Classification using various machine learning methods and combinations of key-phrases and visual features (2016) 0.00

0.001596889 = product of:
  0.0143720005 = sum of:
    0.0044784215 = weight(_text_:in in 2748) [ClassicSimilarity], result of:
      0.0044784215 = score(doc=2748,freq=2.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.15028831 = fieldWeight in 2748, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.078125 = fieldNorm(doc=2748)
    0.0098935785 = product of:
      0.029680735 = sum of:
        0.029680735 = weight(_text_:22 in 2748) [ClassicSimilarity], result of:
          0.029680735 = score(doc=2748,freq=2.0), product of:
            0.076713994 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.021906832 = queryNorm
            0.38690117 = fieldWeight in 2748, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=2748)
      0.33333334 = coord(1/3)
  0.11111111 = coord(2/18)

Date: 1. 2.2016 18:25:22
Series: Lecture notes in computer science ; 9398

Yoon, Y.; Lee, C.; Lee, G.G.: ¬An effective procedure for constructing a hierarchical text classification system (2006) 0.00

0.0014661439 = product of:
  0.013195295 = sum of:
    0.0062697898 = weight(_text_:in in 5273) [ClassicSimilarity], result of:
      0.0062697898 = score(doc=5273,freq=8.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.21040362 = fieldWeight in 5273, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5273)
    0.0069255047 = product of:
      0.020776514 = sum of:
        0.020776514 = weight(_text_:22 in 5273) [ClassicSimilarity], result of:
          0.020776514 = score(doc=5273,freq=2.0), product of:
            0.076713994 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.021906832 = queryNorm
            0.2708308 = fieldWeight in 5273, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=5273)
      0.33333334 = coord(1/3)
  0.11111111 = coord(2/18)

Abstract: In text categorization tasks, classification on some class hierarchies has better results than in cases without the hierarchy. Currently, because a large number of documents are divided into several subgroups in a hierarchy, we can appropriately use a hierarchical classification method. However, we have no systematic method to build a hierarchical classification system that performs well with large collections of practical data. In this article, we introduce a new evaluation scheme for internal node classifiers, which can be used effectively to develop a hierarchical classification system. We also show that our method for constructing the hierarchical classification system is very effective, especially for the task of constructing classifiers applied to hierarchy tree with a lot of levels.
Date: 22. 7.2006 16:24:52

Savic, D.: Designing an expert system for classifying office documents (1994) 0.00

0.0014503849 = product of:
  0.013053464 = sum of:
    0.0050667557 = weight(_text_:in in 2655) [ClassicSimilarity], result of:
      0.0050667557 = score(doc=2655,freq=4.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.17003182 = fieldWeight in 2655, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.0625 = fieldNorm(doc=2655)
    0.007986708 = product of:
      0.023960123 = sum of:
        0.023960123 = weight(_text_:29 in 2655) [ClassicSimilarity], result of:
          0.023960123 = score(doc=2655,freq=2.0), product of:
            0.077061385 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.021906832 = queryNorm
            0.31092256 = fieldWeight in 2655, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.0625 = fieldNorm(doc=2655)
      0.33333334 = coord(1/3)
  0.11111111 = coord(2/18)

Abstract: Can records management benefit from artificial intelligence technology, in particular from expert systems? Gives an answer to this question by showing an example of a small scale prototype project in automatic classification of office documents. Project methodology and basic elements of an expert system's approach are elaborated to give guidelines to potential users of this promising technology
Source: Records management quarterly. 28(1994) no.3, S.20-29

Chung, Y.-M.; Noh, Y.-H.: Developing a specialized directory system by automatically classifying Web documents (2003) 0.00
```
0.0013968822 = product of:
  0.01257194 = sum of:
    0.0065819086 = weight(_text_:in in 1566) [ClassicSimilarity], result of:
      0.0065819086 = score(doc=1566,freq=12.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.22087781 = fieldWeight in 1566, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.046875 = fieldNorm(doc=1566)
    0.005990031 = product of:
      0.017970093 = sum of:
        0.017970093 = weight(_text_:29 in 1566) [ClassicSimilarity], result of:
          0.017970093 = score(doc=1566,freq=2.0), product of:
            0.077061385 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.021906832 = queryNorm
            0.23319192 = fieldWeight in 1566, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.046875 = fieldNorm(doc=1566)
      0.33333334 = coord(1/3)
  0.11111111 = coord(2/18)
```
Abstract

This study developed a specialized directory system using an automatic classification technique. Economics was selected as the subject field for the classification experiments with Web documents. The classification scheme of the directory follows the DDC, and subject terms representing each class number or subject category were selected from the DDC table to construct a representative term dictionary. In collecting and classifying the Web documents, various strategies were tested in order to find the optimal thresholds. In the classification experiments, Web documents in economics were classified into a total of 757 hierarchical subject categories built from the DDC scheme. The first and second experiments using the representative term dictionary resulted in relatively high precision ratios of 77 and 60%, respectively. The third experiment employing a machine learning-based k-nearest neighbours (kNN) classifier in a closed experimental setting achieved a precision ratio of 96%. This implies that it is possible to enhance the classification performance by applying a hybrid method combining a dictionary-based technique and a kNN classifier

Source

Journal of information science. 29(2003) no.2, S.117-126

Drori, O.; Alon, N.: Using document classification for displaying search results (2003) 0.00

0.0013331628 = product of:
  0.011998464 = sum of:
    0.006008433 = weight(_text_:in in 1565) [ClassicSimilarity], result of:
      0.006008433 = score(doc=1565,freq=10.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.20163295 = fieldWeight in 1565, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.046875 = fieldNorm(doc=1565)
    0.005990031 = product of:
      0.017970093 = sum of:
        0.017970093 = weight(_text_:29 in 1565) [ClassicSimilarity], result of:
          0.017970093 = score(doc=1565,freq=2.0), product of:
            0.077061385 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.021906832 = queryNorm
            0.23319192 = fieldWeight in 1565, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.046875 = fieldNorm(doc=1565)
      0.33333334 = coord(1/3)
  0.11111111 = coord(2/18)

Abstract: In this paper, four self-developed user interfaces that display document search results using different methods were compared. In order to create the four interfaces, two information elements: document categories and lines from the document were used. A user study compared the four interfaces. It was found that the category addition to the interface was beneficial in both measurable and subjective measures. It was also found that displaying the relevant lines from the document increased the effectiveness and shortened the search time in all cases and tasks. It was found that the participants preferred the interface containing categories and relevant lines to all other interfaces checked. It was also the fastest in the objective time measurement. Another sub-research that was conducted showed that the most important parameter for the users was the confidence level that the answer was accurate, and the least important parameter was the feeling of comfort while conducting a search
Source: Journal of information science. 29(2003) no.2, S.97-106

Ruocco, A.S.; Frieder, O.: Clustering and classification of large document bases in a parallel environment (1997) 0.00

0.0012690867 = product of:
  0.01142178 = sum of:
    0.004433411 = weight(_text_:in in 1661) [ClassicSimilarity], result of:
      0.004433411 = score(doc=1661,freq=4.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.14877784 = fieldWeight in 1661, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1661)
    0.006988369 = product of:
      0.020965107 = sum of:
        0.020965107 = weight(_text_:29 in 1661) [ClassicSimilarity], result of:
          0.020965107 = score(doc=1661,freq=2.0), product of:
            0.077061385 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.021906832 = queryNorm
            0.27205724 = fieldWeight in 1661, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1661)
      0.33333334 = coord(1/3)
  0.11111111 = coord(2/18)

Abstract: Proposes the use of parallel computing systems to overcome the computationally intense clustering process. Examines 2 operations: clustering a document set and classifying the document set. Uses a subset of the TIPSTER corpus, specifically, articles from the Wall Street Journal. Document set classification was performed without the large storage requirements for ancillary data matrices. The time performance of the parallel systems was an improvement over sequential systems times, and produced the same clustering and classification scheme. Results show near linear speed up in higher threshold clustering applications
Date: 29. 7.1998 17:45:02

Liu, X.; Yu, S.; Janssens, F.; Glänzel, W.; Moreau, Y.; Moor, B.de: Weighted hybrid clustering by combining text mining and bibliometrics on a large-scale journal database (2010) 0.00
```
0.0012626819 = product of:
  0.011364137 = sum of:
    0.0053741056 = weight(_text_:in in 3464) [ClassicSimilarity], result of:
      0.0053741056 = score(doc=3464,freq=8.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.18034597 = fieldWeight in 3464, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.046875 = fieldNorm(doc=3464)
    0.005990031 = product of:
      0.017970093 = sum of:
        0.017970093 = weight(_text_:29 in 3464) [ClassicSimilarity], result of:
          0.017970093 = score(doc=3464,freq=2.0), product of:
            0.077061385 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.021906832 = queryNorm
            0.23319192 = fieldWeight in 3464, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.046875 = fieldNorm(doc=3464)
      0.33333334 = coord(1/3)
  0.11111111 = coord(2/18)
```
Abstract

We propose a new hybrid clustering framework to incorporate text mining with bibliometrics in journal set analysis. The framework integrates two different approaches: clustering ensemble and kernel-fusion clustering. To improve the flexibility and the efficiency of processing large-scale data, we propose an information-based weighting scheme to leverage the effect of multiple data sources in hybrid clustering. Three different algorithms are extended by the proposed weighting scheme and they are employed on a large journal set retrieved from the Web of Science (WoS) database. The clustering performance of the proposed algorithms is systematically evaluated using multiple evaluation methods, and they were cross-compared with alternative methods. Experimental results demonstrate that the proposed weighted hybrid clustering strategy is superior to other methods in clustering performance and efficiency. The proposed approach also provides a more refined structural mapping of journal sets, which is useful for monitoring and detecting new trends in different scientific fields.

Date

1. 6.2010 9:29:57
Li, T.; Zhu, S.; Ogihara, M.: Hierarchical document classification using automatically generated hierarchy (2007) 0.00
```
0.0012626819 = product of:
  0.011364137 = sum of:
    0.0053741056 = weight(_text_:in in 4797) [ClassicSimilarity], result of:
      0.0053741056 = score(doc=4797,freq=8.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.18034597 = fieldWeight in 4797, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.046875 = fieldNorm(doc=4797)
    0.005990031 = product of:
      0.017970093 = sum of:
        0.017970093 = weight(_text_:29 in 4797) [ClassicSimilarity], result of:
          0.017970093 = score(doc=4797,freq=2.0), product of:
            0.077061385 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.021906832 = queryNorm
            0.23319192 = fieldWeight in 4797, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.046875 = fieldNorm(doc=4797)
      0.33333334 = coord(1/3)
  0.11111111 = coord(2/18)
```
Abstract

Automated text categorization has witnessed a booming interest with the exponential growth of information and the ever-increasing needs for organizations. The underlying hierarchical structure identifies the relationships of dependence between different categories and provides valuable sources of information for categorization. Although considerable research has been conducted in the field of hierarchical document categorization, little has been done on automatic generation of topic hierarchies. In this paper, we propose the method of using linear discriminant projection to generate more meaningful intermediate levels of hierarchies in large flat sets of classes. The linear discriminant projection approach first transforms all documents onto a low-dimensional space and then clusters the categories into hier- archies accordingly. The paper also investigates the effect of using generated hierarchical structure for text classification. Our experiments show that generated hierarchies improve classification performance in most cases.

Source

Journal of intelligent information systems. 29(2007) no.2, S.211-230

Dubin, D.: Dimensions and discriminability (1998) 0.00

0.0012621018 = product of:
  0.011358916 = sum of:
    0.004433411 = weight(_text_:in in 2338) [ClassicSimilarity], result of:
      0.004433411 = score(doc=2338,freq=4.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.14877784 = fieldWeight in 2338, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2338)
    0.0069255047 = product of:
      0.020776514 = sum of:
        0.020776514 = weight(_text_:22 in 2338) [ClassicSimilarity], result of:
          0.020776514 = score(doc=2338,freq=2.0), product of:
            0.076713994 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.021906832 = queryNorm
            0.2708308 = fieldWeight in 2338, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2338)
      0.33333334 = coord(1/3)
  0.11111111 = coord(2/18)

Abstract: Visualization interfaces can improve subject access by highlighting the inclusion of document representation components in similarity and discrimination relationships. Within a set of retrieved documents, what kinds of groupings can index terms and subject headings make explicit? The role of controlled vocabulary in classifying search output is examined
Date: 22. 9.1997 19:16:05

Jenkins, C.: Automatic classification of Web resources using Java and Dewey Decimal Classification (1998) 0.00

0.0012621018 = product of:
  0.011358916 = sum of:
    0.004433411 = weight(_text_:in in 1673) [ClassicSimilarity], result of:
      0.004433411 = score(doc=1673,freq=4.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.14877784 = fieldWeight in 1673, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1673)
    0.0069255047 = product of:
      0.020776514 = sum of:
        0.020776514 = weight(_text_:22 in 1673) [ClassicSimilarity], result of:
          0.020776514 = score(doc=1673,freq=2.0), product of:
            0.076713994 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.021906832 = queryNorm
            0.2708308 = fieldWeight in 1673, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1673)
      0.33333334 = coord(1/3)
  0.11111111 = coord(2/18)

Abstract: The Wolverhampton Web Library (WWLib) is a WWW search engine that provides access to UK based information. The experimental version developed in 1995, was a success but highlighted the need for a much higher degree of automation. An interesting feature of the experimental WWLib was that it organised information according to DDC. Discusses the advantages of classification and describes the automatic classifier that is being developed in Java as part of the new, fully automated WWLib
Date: 1. 8.1996 22:08:06

Golub, K.; Hansson, J.; Soergel, D.; Tudhope, D.: Managing classification in libraries : a methodological outline for evaluating automatic subject indexing and classification in Swedish library catalogues (2015) 0.00
```
0.0012583486 = product of:
  0.011325137 = sum of:
    0.0063334443 = weight(_text_:in in 2300) [ClassicSimilarity], result of:
      0.0063334443 = score(doc=2300,freq=16.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.21253976 = fieldWeight in 2300, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2300)
    0.0049916925 = product of:
      0.0149750775 = sum of:
        0.0149750775 = weight(_text_:29 in 2300) [ClassicSimilarity], result of:
          0.0149750775 = score(doc=2300,freq=2.0), product of:
            0.077061385 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.021906832 = queryNorm
            0.19432661 = fieldWeight in 2300, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2300)
      0.33333334 = coord(1/3)
  0.11111111 = coord(2/18)
```
Abstract

Subject terms play a crucial role in resource discovery but require substantial effort to produce. Automatic subject classification and indexing address problems of scale and sustainability and can be used to enrich existing bibliographic records, establish more connections across and between resources and enhance consistency of bibliographic data. The paper aims to put forward a complex methodological framework to evaluate automatic classification tools of Swedish textual documents based on the Dewey Decimal Classification (DDC) recently introduced to Swedish libraries. Three major complementary approaches are suggested: a quality-built gold standard, retrieval effects, domain analysis. The gold standard is built based on input from at least two catalogue librarians, end-users expert in the subject, end users inexperienced in the subject and automated tools. Retrieval effects are studied through a combination of assigned and free tasks, including factual and comprehensive types. The study also takes into consideration the different role and character of subject terms in various knowledge domains, such as scientific disciplines. As a theoretical framework, domain analysis is used and applied in relation to the implementation of DDC in Swedish libraries and chosen domains of knowledge within the DDC itself.

Source

Classification and authority control: expanding resource discovery: proceedings of the International UDC Seminar 2015, 29-30 October 2015, Lisbon, Portugal. Eds.: Slavic, A. u. M.I. Cordeiro
Ribeiro-Neto, B.; Laender, A.H.F.; Lima, L.R.S. de: ¬An experimental study in automatically categorizing medical documents (2001) 0.00
```
0.0012128986 = product of:
  0.010916088 = sum of:
    0.0059243953 = weight(_text_:in in 5702) [ClassicSimilarity], result of:
      0.0059243953 = score(doc=5702,freq=14.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.19881277 = fieldWeight in 5702, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5702)
    0.0049916925 = product of:
      0.0149750775 = sum of:
        0.0149750775 = weight(_text_:29 in 5702) [ClassicSimilarity], result of:
          0.0149750775 = score(doc=5702,freq=2.0), product of:
            0.077061385 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.021906832 = queryNorm
            0.19432661 = fieldWeight in 5702, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5702)
      0.33333334 = coord(1/3)
  0.11111111 = coord(2/18)
```
Abstract

In this article, we evaluate the retrieval performance of an algorithm that automatically categorizes medical documents. The categorization, which consists in assigning an International Code of Disease (ICD) to the medical document under examination, is based on wellknown information retrieval techniques. The algorithm, which we proposed, operates in a fully automatic mode and requires no supervision or training data. Using a database of 20,569 documents, we verify that the algorithm attains levels of average precision in the 70-80% range for category coding and in the 60-70% range for subcategory coding. We also carefully analyze the case of those documents whose categorization is not in accordance with the one provided by the human specialists. The vast majority of them represent cases that can only be fully categorized with the assistance of a human subject (because, for instance, they require specific knowledge of a given pathology). For a slim fraction of all documents (0.77% for category coding and 1.4% for subcategory coding), the algorithm makes assignments that are clearly incorrect. However, this fraction corresponds to only one-fourth of the mistakes made by the human specialists

Date

29. 9.2001 13:59:42
Ibekwe-SanJuan, F.; SanJuan, E.: From term variants to research topics (2002) 0.00
```
0.0012128986 = product of:
  0.010916088 = sum of:
    0.0059243953 = weight(_text_:in in 1853) [ClassicSimilarity], result of:
      0.0059243953 = score(doc=1853,freq=14.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.19881277 = fieldWeight in 1853, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1853)
    0.0049916925 = product of:
      0.0149750775 = sum of:
        0.0149750775 = weight(_text_:29 in 1853) [ClassicSimilarity], result of:
          0.0149750775 = score(doc=1853,freq=2.0), product of:
            0.077061385 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.021906832 = queryNorm
            0.19432661 = fieldWeight in 1853, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1853)
      0.33333334 = coord(1/3)
  0.11111111 = coord(2/18)
```
Abstract

In a scientific and technological watch (STW) task, an expert user needs to survey the evolution of research topics in his area of specialisation in order to detect interesting changes. The majority of methods proposing evaluation metrics (bibliometrics and scientometrics studies) for STW rely solely an statistical data analysis methods (Co-citation analysis, co-word analysis). Such methods usually work an structured databases where the units of analysis (words, keywords) are already attributed to documents by human indexers. The advent of huge amounts of unstructured textual data has rendered necessary the integration of natural language processing (NLP) techniques to first extract meaningful units from texts. We propose a method for STW which is NLP-oriented. The method not only analyses texts linguistically in order to extract terms from them, but also uses linguistic relations (syntactic variations) as the basis for clustering. Terms and variation relations are formalised as weighted di-graphs which the clustering algorithm, CPCL (Classification by Preferential Clustered Link) will seek to reduce in order to produces classes. These classes ideally represent the research topics present in the corpus. The results of the classification are subjected to validation by an expert in STW.

Source

Knowledge organization. 29(2002) nos.3/4, S.181-197

Egbert, J.; Biber, D.; Davies, M.: Developing a bottom-up, user-based method of web register classification (2015) 0.00

0.0011766955 = product of:
  0.010590259 = sum of:
    0.0046541123 = weight(_text_:in in 2158) [ClassicSimilarity], result of:
      0.0046541123 = score(doc=2158,freq=6.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.1561842 = fieldWeight in 2158, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.046875 = fieldNorm(doc=2158)
    0.0059361467 = product of:
      0.01780844 = sum of:
        0.01780844 = weight(_text_:22 in 2158) [ClassicSimilarity], result of:
          0.01780844 = score(doc=2158,freq=2.0), product of:
            0.076713994 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.021906832 = queryNorm
            0.23214069 = fieldWeight in 2158, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=2158)
      0.33333334 = coord(1/3)
  0.11111111 = coord(2/18)

Abstract: This paper introduces a project to develop a reliable, cost-effective method for classifying Internet texts into register categories, and apply that approach to the analysis of a large corpus of web documents. To date, the project has proceeded in 2 key phases. First, we developed a bottom-up method for web register classification, asking end users of the web to utilize a decision-tree survey to code relevant situational characteristics of web documents, resulting in a bottom-up identification of register and subregister categories. We present details regarding the development and testing of this method through a series of 10 pilot studies. Then, in the second phase of our project we applied this procedure to a corpus of 53,000 web documents. An analysis of the results demonstrates the effectiveness of these methods for web register classification and provides a preliminary description of the types and distribution of registers on the web.
Date: 4. 8.2015 19:22:04

Mengle, S.; Goharian, N.: Passage detection using text classification (2009) 0.00
```
0.0011590791 = product of:
  0.0104317125 = sum of:
    0.0054849237 = weight(_text_:in in 2765) [ClassicSimilarity], result of:
      0.0054849237 = score(doc=2765,freq=12.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.18406484 = fieldWeight in 2765, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2765)
    0.0049467892 = product of:
      0.014840367 = sum of:
        0.014840367 = weight(_text_:22 in 2765) [ClassicSimilarity], result of:
          0.014840367 = score(doc=2765,freq=2.0), product of:
            0.076713994 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.021906832 = queryNorm
            0.19345059 = fieldWeight in 2765, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2765)
      0.33333334 = coord(1/3)
  0.11111111 = coord(2/18)
```
Abstract

Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retrieval is the determination of the degree of relevance of blocks of text, namely passages, comprising a document. Rather than determining the relevance of a document in its entirety, passage retrieval determines the relevance of the individual passages. As such, modified traditional information-retrieval techniques compare terms found in user queries with the individual passages to determine a similarity score for passages of interest. In passage detection, passages are classified into predetermined categories. More often than not, passage detection techniques are deployed to detect hidden paragraphs in documents. That is, to hide information, documents are injected with hidden text into passages. Rather than matching query terms against passages to determine their relevance, using text-mining techniques, the passages are classified. Those documents with hidden passages are defined as infected. Thus, simply stated, passage retrieval is the search for passages relevant to a user query, while passage detection is the classification of passages. That is, in passage detection, passages are labeled with one or more categories from a set of predetermined categories. We present a keyword-based dynamic passage approach (KDP) and demonstrate that KDP outperforms statistically significantly (99% confidence) the other document-splitting approaches by 12% to 18% in the passage detection and passage category-prediction tasks. Furthermore, we evaluate the effects of the feature selection, passage length, ambiguous passages, and finally training-data category distribution on passage-detection accuracy.

Date

22. 3.2009 19:14:43

Savic, D.: Automatic classification of office documents : review of available methods and techniques (1995) 0.00

0.0011248072 = product of:
  0.010123264 = sum of:
    0.0031348949 = weight(_text_:in in 2219) [ClassicSimilarity], result of:
      0.0031348949 = score(doc=2219,freq=2.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.10520181 = fieldWeight in 2219, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2219)
    0.006988369 = product of:
      0.020965107 = sum of:
        0.020965107 = weight(_text_:29 in 2219) [ClassicSimilarity], result of:
          0.020965107 = score(doc=2219,freq=2.0), product of:
            0.077061385 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.021906832 = queryNorm
            0.27205724 = fieldWeight in 2219, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2219)
      0.33333334 = coord(1/3)
  0.11111111 = coord(2/18)

Abstract: Classification of office documents is one of the administrative functions carried out by almost every organization and institution which sends and receives correspondence. Processing of this increasing amount of information coming and out going mail, in particular its classification, is time consuming and expensive. More and more organizations are seeking a solution for meeting this challenge by designing computer based systems for automatic classification. Examines the present status of available knowledge and methodology which can be used for automatic classification of office documents. Besides a review of classic methods and techniques, the focus id also placed on the application of artificial intelligence
Source: Records management quarterly. 29(1995) no.4, S.3-18

Ruiz, M.E.; Srinivasan, P.: Combining machine learning and hierarchical indexing structures for text categorization (2001) 0.00

0.0011248072 = product of:
  0.010123264 = sum of:
    0.0031348949 = weight(_text_:in in 1595) [ClassicSimilarity], result of:
      0.0031348949 = score(doc=1595,freq=2.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.10520181 = fieldWeight in 1595, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1595)
    0.006988369 = product of:
      0.020965107 = sum of:
        0.020965107 = weight(_text_:29 in 1595) [ClassicSimilarity], result of:
          0.020965107 = score(doc=1595,freq=2.0), product of:
            0.077061385 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.021906832 = queryNorm
            0.27205724 = fieldWeight in 1595, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1595)
      0.33333334 = coord(1/3)
  0.11111111 = coord(2/18)

Date: 11. 5.2003 18:29:44
Source: Advances in classification research, vol.10: proceedings of the 10th ASIS SIG/CR Classification Research Workshop. Ed.: Albrechtsen, H. u. J.E. Mai

Liu, R.-L.: ¬A passage extractor for classification of disease aspect information (2013) 0.00
```
0.0011059797 = product of:
  0.009953817 = sum of:
    0.0050070276 = weight(_text_:in in 1107) [ClassicSimilarity], result of:
      0.0050070276 = score(doc=1107,freq=10.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.16802745 = fieldWeight in 1107, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1107)
    0.0049467892 = product of:
      0.014840367 = sum of:
        0.014840367 = weight(_text_:22 in 1107) [ClassicSimilarity], result of:
          0.014840367 = score(doc=1107,freq=2.0), product of:
            0.076713994 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.021906832 = queryNorm
            0.19345059 = fieldWeight in 1107, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1107)
      0.33333334 = coord(1/3)
  0.11111111 = coord(2/18)
```
Abstract

Retrieval of disease information is often based on several key aspects such as etiology, diagnosis, treatment, prevention, and symptoms of diseases. Automatic identification of disease aspect information is thus essential. In this article, I model the aspect identification problem as a text classification (TC) problem in which a disease aspect corresponds to a category. The disease aspect classification problem poses two challenges to classifiers: (a) a medical text often contains information about multiple aspects of a disease and hence produces noise for the classifiers and (b) text classifiers often cannot extract the textual parts (i.e., passages) about the categories of interest. I thus develop a technique, PETC (Passage Extractor for Text Classification), that extracts passages (from medical texts) for the underlying text classifiers to classify. Case studies on thousands of Chinese and English medical texts show that PETC enhances a support vector machine (SVM) classifier in classifying disease aspect information. PETC also performs better than three state-of-the-art classifier enhancement techniques, including two passage extraction techniques for text classifiers and a technique that employs term proximity information to enhance text classifiers. The contribution is of significance to evidence-based medicine, health education, and healthcare decision support. PETC can be used in those application domains in which a text to be classified may have several parts about different categories.

Date

28.10.2013 19:22:57

Liu, R.-L.: Context recognition for hierarchical text classification (2009) 0.00

0.0010818016 = product of:
  0.009736214 = sum of:
    0.003800067 = weight(_text_:in in 2760) [ClassicSimilarity], result of:
      0.003800067 = score(doc=2760,freq=4.0), product of:
        0.029798867 = queryWeight, product of:
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.021906832 = queryNorm
        0.12752387 = fieldWeight in 2760, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.3602545 = idf(docFreq=30841, maxDocs=44218)
          0.046875 = fieldNorm(doc=2760)
    0.0059361467 = product of:
      0.01780844 = sum of:
        0.01780844 = weight(_text_:22 in 2760) [ClassicSimilarity], result of:
          0.01780844 = score(doc=2760,freq=2.0), product of:
            0.076713994 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.021906832 = queryNorm
            0.23214069 = fieldWeight in 2760, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=2760)
      0.33333334 = coord(1/3)
  0.11111111 = coord(2/18)

Abstract: Information is often organized as a text hierarchy. A hierarchical text-classification system is thus essential for the management, sharing, and dissemination of information. It aims to automatically classify each incoming document into zero, one, or several categories in the text hierarchy. In this paper, we present a technique called CRHTC (context recognition for hierarchical text classification) that performs hierarchical text classification by recognizing the context of discussion (COD) of each category. A category's COD is governed by its ancestor categories, whose contents indicate contextual backgrounds of the category. A document may be classified into a category only if its content matches the category's COD. CRHTC does not require any trials to manually set parameters, and hence is more portable and easier to implement than other methods. It is empirically evaluated under various conditions. The results show that CRHTC achieves both better and more stable performance than several hierarchical and nonhierarchical text-classification methodologies.
Date: 22. 3.2009 19:11:54

Search (137 results, page 1 of 7)

Authors

Years

Themes