Search (136 results, page 2 of 7)

Hagedorn, K.; Chapman, S.; Newman, D.: Enhancing search and browse using automated clustering of subject metadata (2007) 0.01

0.009632302 = product of:
  0.044950746 = sum of:
    0.020922182 = weight(_text_:web in 1168) [ClassicSimilarity], result of:
      0.020922182 = score(doc=1168,freq=2.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.21634221 = fieldWeight in 1168, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=1168)
    0.0060537956 = weight(_text_:information in 1168) [ClassicSimilarity], result of:
      0.0060537956 = score(doc=1168,freq=2.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.116372846 = fieldWeight in 1168, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=1168)
    0.01797477 = weight(_text_:retrieval in 1168) [ClassicSimilarity], result of:
      0.01797477 = score(doc=1168,freq=2.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.20052543 = fieldWeight in 1168, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.046875 = fieldNorm(doc=1168)
  0.21428572 = coord(3/14)

Abstract: The Web puzzle of online information resources often hinders end-users from effective and efficient access to these resources. Clustering resources into appropriate subject-based groupings may help alleviate these difficulties, but will it work with heterogeneous material? The University of Michigan and the University of California Irvine joined forces to test automatically enhancing metadata records using the Topic Modeling algorithm on the varied OAIster corpus. We created labels for the resulting clusters of metadata records, matched the clusters to an in-house classification system, and developed a prototype that would showcase methods for search and retrieval using the enhanced records. Results indicated that while the algorithm was somewhat time-intensive to run and using a local classification scheme had its drawbacks, precise clustering of records was achieved and the prototype interface proved that faceted classification could be powerful in helping end-users find resources.

Vizine-Goetz, D.: NetLab / OCLC collaboration seeks to improve Web searching (1999) 0.01

0.00926118 = product of:
  0.064828254 = sum of:
    0.034870304 = weight(_text_:web in 4180) [ClassicSimilarity], result of:
      0.034870304 = score(doc=4180,freq=2.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.36057037 = fieldWeight in 4180, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.078125 = fieldNorm(doc=4180)
    0.029957948 = weight(_text_:retrieval in 4180) [ClassicSimilarity], result of:
      0.029957948 = score(doc=4180,freq=2.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.33420905 = fieldWeight in 4180, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.078125 = fieldNorm(doc=4180)
  0.14285715 = coord(2/14)

Theme: Klassifikationssysteme im Online-Retrieval

Rijsbergen, C.J. van: Automatic classification in information retrieval (1978) 0.01

0.009153739 = product of:
  0.06407617 = sum of:
    0.016143454 = weight(_text_:information in 2412) [ClassicSimilarity], result of:
      0.016143454 = score(doc=2412,freq=2.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.3103276 = fieldWeight in 2412, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.125 = fieldNorm(doc=2412)
    0.047932718 = weight(_text_:retrieval in 2412) [ClassicSimilarity], result of:
      0.047932718 = score(doc=2412,freq=2.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.5347345 = fieldWeight in 2412, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.125 = fieldNorm(doc=2412)
  0.14285715 = coord(2/14)

Choi, B.; Peng, X.: Dynamic and hierarchical classification of Web pages (2004) 0.01
```
0.008181273 = product of:
  0.057268906 = sum of:
    0.046783425 = weight(_text_:web in 2555) [ClassicSimilarity], result of:
      0.046783425 = score(doc=2555,freq=10.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.48375595 = fieldWeight in 2555, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=2555)
    0.0104854815 = weight(_text_:information in 2555) [ClassicSimilarity], result of:
      0.0104854815 = score(doc=2555,freq=6.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.20156369 = fieldWeight in 2555, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=2555)
  0.14285715 = coord(2/14)
```
Abstract

Automatic classification of Web pages is an effective way to organise the vast amount of information and to assist in retrieving relevant information from the Internet. Although many automatic classification systems have been proposed, most of them ignore the conflict between the fixed number of categories and the growing number of Web pages being added into the systems. They also require searching through all existing categories to make any classification. This article proposes a dynamic and hierarchical classification system that is capable of adding new categories as required, organising the Web pages into a tree structure, and classifying Web pages by searching through only one path of the tree. The proposed single-path search technique reduces the search complexity from (n) to (log(n)). Test results show that the system improves the accuracy of classification by 6 percent in comparison to related systems. The dynamic-category expansion technique also achieves satisfying results for adding new categories into the system as required.

Source

Online information review. 28(2004) no.2, S.139-147
Smiraglia, R.P.; Cai, X.: Tracking the evolution of clustering, machine learning, automatic indexing and automatic classification in knowledge organization (2017) 0.01
```
0.008026919 = product of:
  0.037458956 = sum of:
    0.017435152 = weight(_text_:web in 3627) [ClassicSimilarity], result of:
      0.017435152 = score(doc=3627,freq=2.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.18028519 = fieldWeight in 3627, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3627)
    0.0050448296 = weight(_text_:information in 3627) [ClassicSimilarity], result of:
      0.0050448296 = score(doc=3627,freq=2.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.09697737 = fieldWeight in 3627, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3627)
    0.014978974 = weight(_text_:retrieval in 3627) [ClassicSimilarity], result of:
      0.014978974 = score(doc=3627,freq=2.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.16710453 = fieldWeight in 3627, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3627)
  0.21428572 = coord(3/14)
```
Abstract

A very important extension of the traditional domain of knowledge organization (KO) arises from attempts to incorporate techniques devised in the computer science domain for automatic concept extraction and for grouping, categorizing, clustering and otherwise organizing knowledge using mechanical means. Four specific terms have emerged to identify the most prevalent techniques: machine learning, clustering, automatic indexing, and automatic classification. Our study presents three domain analytical case analyses in search of answers. The first case relies on citations located using the ISKO-supported "Knowledge Organization Bibliography." The second case relies on works in both Web of Science and SCOPUS. Case three applies co-word analysis and citation analysis to the contents of the papers in the present special issue. We observe scholars involved in "clustering" and "automatic classification" who share common thematic emphases. But we have found no coherence, no common activity and no social semantics. We have not found a research front, or a common teleology within the KO domain. We also have found a lively group of authors who have succeeded in submitting papers to this special issue, and their work quite interestingly aligns with the case studies we report. There is an emphasis on KO for information retrieval; there is much work on clustering (which involves conceptual points within texts) and automatic classification (which involves semantic groupings at the meta-document level).

Wu, M.; Fuller, M.; Wilkinson, R.: Using clustering and classification approaches in interactive retrieval (2001) 0.01

0.008009522 = product of:
  0.05606665 = sum of:
    0.014125523 = weight(_text_:information in 2666) [ClassicSimilarity], result of:
      0.014125523 = score(doc=2666,freq=2.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.27153665 = fieldWeight in 2666, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.109375 = fieldNorm(doc=2666)
    0.04194113 = weight(_text_:retrieval in 2666) [ClassicSimilarity], result of:
      0.04194113 = score(doc=2666,freq=2.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.46789268 = fieldWeight in 2666, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.109375 = fieldNorm(doc=2666)
  0.14285715 = coord(2/14)

Source: Information processing and management. 37(2001) no.3, S.459-484

Lim, C.S.; Lee, K.J.; Kim, G.C.: Multiple sets of features for automatic genre classification of web documents (2005) 0.01
```
0.00783814 = product of:
  0.05486698 = sum of:
    0.046129078 = weight(_text_:web in 1048) [ClassicSimilarity], result of:
      0.046129078 = score(doc=1048,freq=14.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.47698978 = fieldWeight in 1048, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1048)
    0.008737902 = weight(_text_:information in 1048) [ClassicSimilarity], result of:
      0.008737902 = score(doc=1048,freq=6.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.16796975 = fieldWeight in 1048, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1048)
  0.14285715 = coord(2/14)
```
Abstract

With the increase of information on the Web, it is difficult to find desired information quickly out of the documents retrieved by a search engine. One way to solve this problem is to classify web documents according to various criteria. Most document classification has been focused on a subject or a topic of a document. A genre or a style is another view of a document different from a subject or a topic. The genre is also a criterion to classify documents. In this paper, we suggest multiple sets of features to classify genres of web documents. The basic set of features, which have been proposed in the previous studies, is acquired from the textual properties of documents, such as the number of sentences, the number of a certain word, etc. However, web documents are different from textual documents in that they contain URL and HTML tags within the pages. We introduce new sets of features specific to web documents, which are extracted from URL and HTML tags. The present work is an attempt to evaluate the performance of the proposed sets of features, and to discuss their characteristics. Finally, we conclude which is an appropriate set of features in automatic genre classification of web documents.

Source

Information processing and management. 41(2005) no.5, S.1263-1276
Ko, Y.: ¬A new term-weighting scheme for text classification using the odds of positive and negative class probabilities (2015) 0.01
```
0.007581752 = product of:
  0.05307226 = sum of:
    0.01712272 = weight(_text_:information in 2339) [ClassicSimilarity], result of:
      0.01712272 = score(doc=2339,freq=16.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.3291521 = fieldWeight in 2339, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=2339)
    0.03594954 = weight(_text_:retrieval in 2339) [ClassicSimilarity], result of:
      0.03594954 = score(doc=2339,freq=8.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.40105087 = fieldWeight in 2339, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.046875 = fieldNorm(doc=2339)
  0.14285715 = coord(2/14)
```
Abstract

Text classification (TC) is a core technique for text mining and information retrieval. It has been applied to many applications in many different research and industrial areas. Term-weighting schemes assign an appropriate weight to each term to obtain a high TC performance. Although term weighting is one of the important modules for TC and TC has different peculiarities from those in information retrieval, many term-weighting schemes used in information retrieval, such as term frequency-inverse document frequency (tf-idf), have been used in TC in the same manner. The peculiarity of TC that differs most from information retrieval is the existence of class information. This article proposes a new term-weighting scheme that uses class information using positive and negative class distributions. As a result, the proposed scheme, log tf-TRR, consistently performs better than do other schemes using class information as well as traditional schemes such as tf-idf.

Source

Journal of the Association for Information Science and Technology. 66(2015) no.12, S.2553-2565
Liu, R.-L.: ¬A passage extractor for classification of disease aspect information (2013) 0.01
```
0.007503826 = product of:
  0.035017855 = sum of:
    0.013347364 = weight(_text_:information in 1107) [ClassicSimilarity], result of:
      0.013347364 = score(doc=1107,freq=14.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.256578 = fieldWeight in 1107, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1107)
    0.014978974 = weight(_text_:retrieval in 1107) [ClassicSimilarity], result of:
      0.014978974 = score(doc=1107,freq=2.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.16710453 = fieldWeight in 1107, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1107)
    0.0066915164 = product of:
      0.020074548 = sum of:
        0.020074548 = weight(_text_:22 in 1107) [ClassicSimilarity], result of:
          0.020074548 = score(doc=1107,freq=2.0), product of:
            0.103770934 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.029633347 = queryNorm
            0.19345059 = fieldWeight in 1107, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1107)
      0.33333334 = coord(1/3)
  0.21428572 = coord(3/14)
```
Abstract

Retrieval of disease information is often based on several key aspects such as etiology, diagnosis, treatment, prevention, and symptoms of diseases. Automatic identification of disease aspect information is thus essential. In this article, I model the aspect identification problem as a text classification (TC) problem in which a disease aspect corresponds to a category. The disease aspect classification problem poses two challenges to classifiers: (a) a medical text often contains information about multiple aspects of a disease and hence produces noise for the classifiers and (b) text classifiers often cannot extract the textual parts (i.e., passages) about the categories of interest. I thus develop a technique, PETC (Passage Extractor for Text Classification), that extracts passages (from medical texts) for the underlying text classifiers to classify. Case studies on thousands of Chinese and English medical texts show that PETC enhances a support vector machine (SVM) classifier in classifying disease aspect information. PETC also performs better than three state-of-the-art classifier enhancement techniques, including two passage extraction techniques for text classifiers and a technique that employs term proximity information to enhance text classifiers. The contribution is of significance to evidence-based medicine, health education, and healthcare decision support. PETC can be used in those application domains in which a text to be classified may have several parts about different categories.

Date

28.10.2013 19:22:57

Source

Journal of the American Society for Information Science and Technology. 64(2013) no.11, S.2265-2277

Golub, K.: Automated subject classification of textual Web pages, based on a controlled vocabulary : challenges and recommendations (2006) 0.01

0.0072008185 = product of:
  0.050405726 = sum of:
    0.041844364 = weight(_text_:web in 5897) [ClassicSimilarity], result of:
      0.041844364 = score(doc=5897,freq=8.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.43268442 = fieldWeight in 5897, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=5897)
    0.00856136 = weight(_text_:information in 5897) [ClassicSimilarity], result of:
      0.00856136 = score(doc=5897,freq=4.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.16457605 = fieldWeight in 5897, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=5897)
  0.14285715 = coord(2/14)

Abstract: The primary objective of this study was to identify and address problems of applying a controlled vocabulary in automated subject classification of textual Web pages, in the area of engineering. Web pages have special characteristics such as structural information, but are at the same time rather heterogeneous. The classification approach used comprises string-to-string matching between words in a term list extracted from the Ei (Engineering Information) thesaurus and classification scheme, and words in the text to be classified. Based on a sample of 70 Web pages, a number of problems with the term list are identified. Reasons for those problems are discussed and improvements proposed. Methods for implementing the improvements are also specified, suggesting further research.

Yu, W.; Gong, Y.: Document clustering by concept factorization (2004) 0.01

0.006865305 = product of:
  0.04805713 = sum of:
    0.012107591 = weight(_text_:information in 4084) [ClassicSimilarity], result of:
      0.012107591 = score(doc=4084,freq=2.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.23274569 = fieldWeight in 4084, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.09375 = fieldNorm(doc=4084)
    0.03594954 = weight(_text_:retrieval in 4084) [ClassicSimilarity], result of:
      0.03594954 = score(doc=4084,freq=2.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.40105087 = fieldWeight in 4084, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.09375 = fieldNorm(doc=4084)
  0.14285715 = coord(2/14)

Source: SIGIR'04: Proceedings of the 27th Annual International ACM-SIGIR Conference an Research and Development in Information Retrieval. Ed.: K. Järvelin, u.a

Chung, Y.-M.; Noh, Y.-H.: Developing a specialized directory system by automatically classifying Web documents (2003) 0.01
```
0.0068425946 = product of:
  0.04789816 = sum of:
    0.041844364 = weight(_text_:web in 1566) [ClassicSimilarity], result of:
      0.041844364 = score(doc=1566,freq=8.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.43268442 = fieldWeight in 1566, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=1566)
    0.0060537956 = weight(_text_:information in 1566) [ClassicSimilarity], result of:
      0.0060537956 = score(doc=1566,freq=2.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.116372846 = fieldWeight in 1566, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=1566)
  0.14285715 = coord(2/14)
```
Abstract

This study developed a specialized directory system using an automatic classification technique. Economics was selected as the subject field for the classification experiments with Web documents. The classification scheme of the directory follows the DDC, and subject terms representing each class number or subject category were selected from the DDC table to construct a representative term dictionary. In collecting and classifying the Web documents, various strategies were tested in order to find the optimal thresholds. In the classification experiments, Web documents in economics were classified into a total of 757 hierarchical subject categories built from the DDC scheme. The first and second experiments using the representative term dictionary resulted in relatively high precision ratios of 77 and 60%, respectively. The third experiment employing a machine learning-based k-nearest neighbours (kNN) classifier in a closed experimental setting achieved a precision ratio of 96%. This implies that it is possible to enhance the classification performance by applying a hybrid method combining a dictionary-based technique and a kNN classifier

Source

Journal of information science. 29(2003) no.2, S.117-126

Guerrero-Bote, V.P.; Moya Anegón, F. de; Herrero Solana, V.: Document organization using Kohonen's algorithm (2002) 0.01

0.00683917 = product of:
  0.04787419 = sum of:
    0.013980643 = weight(_text_:information in 2564) [ClassicSimilarity], result of:
      0.013980643 = score(doc=2564,freq=6.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.2687516 = fieldWeight in 2564, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0625 = fieldNorm(doc=2564)
    0.033893548 = weight(_text_:retrieval in 2564) [ClassicSimilarity], result of:
      0.033893548 = score(doc=2564,freq=4.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.37811437 = fieldWeight in 2564, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0625 = fieldNorm(doc=2564)
  0.14285715 = coord(2/14)

Abstract: The classification of documents from a bibliographic database is a task that is linked to processes of information retrieval based on partial matching. A method is described of vectorizing reference documents from LISA which permits their topological organization using Kohonen's algorithm. As an example a map is generated of 202 documents from LISA, and an analysis is made of the possibilities of this type of neural network with respect to the development of information retrieval systems based on graphical browsing.
Source: Information processing and management. 38(2002) no.1, S.79-89

Ingwersen, P.; Wormell, I.: Ranganathan in the perspective of advanced information retrieval (1992) 0.01

0.006472671 = product of:
  0.045308694 = sum of:
    0.011415146 = weight(_text_:information in 7695) [ClassicSimilarity], result of:
      0.011415146 = score(doc=7695,freq=4.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.21943474 = fieldWeight in 7695, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0625 = fieldNorm(doc=7695)
    0.033893548 = weight(_text_:retrieval in 7695) [ClassicSimilarity], result of:
      0.033893548 = score(doc=7695,freq=4.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.37811437 = fieldWeight in 7695, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0625 = fieldNorm(doc=7695)
  0.14285715 = coord(2/14)

Abstract: Examnines Ranganathan's approach to knowledge organisation and its relevance to intellectual accessibility in libraries. Discusses the current and future developments of his methodology and theories in knowledge-based systems. Topics covered include: semi-automatic classification and structure of thesauri; user-intermediary interactions in information retrieval (IR); semantic value-theory and uncertainty principles in IR; and case grammar

McKiernan, G.: Automated categorisation of Web resources : a profile of selected projects, research, products, and services (1996) 0.01

0.006422852 = product of:
  0.044959962 = sum of:
    0.034870304 = weight(_text_:web in 2533) [ClassicSimilarity], result of:
      0.034870304 = score(doc=2533,freq=2.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.36057037 = fieldWeight in 2533, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.078125 = fieldNorm(doc=2533)
    0.010089659 = weight(_text_:information in 2533) [ClassicSimilarity], result of:
      0.010089659 = score(doc=2533,freq=2.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.19395474 = fieldWeight in 2533, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.078125 = fieldNorm(doc=2533)
  0.14285715 = coord(2/14)

Source: New review of information networking. 1996, no.2, S.15-40

Hung, C.-M.; Chien, L.-F.: Web-based text classification in the absence of manually labeled training documents (2007) 0.01

0.006041726 = product of:
  0.04229208 = sum of:
    0.036238287 = weight(_text_:web in 87) [ClassicSimilarity], result of:
      0.036238287 = score(doc=87,freq=6.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.37471575 = fieldWeight in 87, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=87)
    0.0060537956 = weight(_text_:information in 87) [ClassicSimilarity], result of:
      0.0060537956 = score(doc=87,freq=2.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.116372846 = fieldWeight in 87, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=87)
  0.14285715 = coord(2/14)

Abstract: Most text classification techniques assume that manually labeled documents (corpora) can be easily obtained while learning text classifiers. However, labeled training documents are sometimes unavailable or inadequate even if they are available. The goal of this article is to present a self-learned approach to extract high-quality training documents from the Web when the required manually labeled documents are unavailable or of poor quality. To learn a text classifier automatically, we need only a set of user-defined categories and some highly related keywords. Extensive experiments are conducted to evaluate the performance of the proposed approach using the test set from the Reuters-21578 news data set. The experiments show that very promising results can be achieved only by using automatically extracted documents from the Web.
Source: Journal of the American Society for Information Science and Technology. 58(2007) no.1, S.88-96

Fang, H.: Classifying research articles in multidisciplinary sciences journals into subject categories (2015) 0.01
```
0.005702162 = product of:
  0.039915133 = sum of:
    0.034870304 = weight(_text_:web in 2194) [ClassicSimilarity], result of:
      0.034870304 = score(doc=2194,freq=8.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.36057037 = fieldWeight in 2194, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2194)
    0.0050448296 = weight(_text_:information in 2194) [ClassicSimilarity], result of:
      0.0050448296 = score(doc=2194,freq=2.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.09697737 = fieldWeight in 2194, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2194)
  0.14285715 = coord(2/14)
```
Abstract

In the Thomson Reuters Web of Science database, the subject categories of a journal are applied to all articles in the journal. However, many articles in multidisciplinary Sciences journals may only be represented by a small number of subject categories. To provide more accurate information on the research areas of articles in such journals, we can classify articles in these journals into subject categories as defined by Web of Science based on their references. For an article in a multidisciplinary sciences journal, the method counts the subject categories in all of the article's references indexed by Web of Science, and uses the most numerous subject categories of the references to determine the most appropriate classification of the article. We used articles in an issue of Proceedings of the National Academy of Sciences (PNAS) to validate the correctness of the method by comparing the obtained results with the categories of the articles as defined by PNAS and their content. This study shows that the method provides more precise search results for the subject category of interest in bibliometric investigations through recognition of articles in multidisciplinary sciences journals whose work relates to a particular subject category.

Object

Web of science

Cui, H.; Heidorn, P.B.; Zhang, H.: ¬An approach to automatic classification of text for information retrieval (2002) 0.01

0.0056635872 = product of:
  0.03964511 = sum of:
    0.009988253 = weight(_text_:information in 174) [ClassicSimilarity], result of:
      0.009988253 = score(doc=174,freq=4.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.1920054 = fieldWeight in 174, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0546875 = fieldNorm(doc=174)
    0.029656855 = weight(_text_:retrieval in 174) [ClassicSimilarity], result of:
      0.029656855 = score(doc=174,freq=4.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.33085006 = fieldWeight in 174, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0546875 = fieldNorm(doc=174)
  0.14285715 = coord(2/14)

Abstract: In this paper, we explore an approach to make better use of semi-structured documents in information retrieval in the domain of biology. Using machine learning techniques, we make those inherent structures explicit by XML markups. This marking up has great potentials in improving task performance in specimen identification and the usability of online flora and fauna.

Golub, K.; Lykke, M.: Automated classification of web pages in hierarchical browsing (2009) 0.01
```
0.0056622867 = product of:
  0.039636005 = sum of:
    0.02465703 = weight(_text_:web in 3614) [ClassicSimilarity], result of:
      0.02465703 = score(doc=3614,freq=4.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.25496176 = fieldWeight in 3614, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3614)
    0.014978974 = weight(_text_:retrieval in 3614) [ClassicSimilarity], result of:
      0.014978974 = score(doc=3614,freq=2.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.16710453 = fieldWeight in 3614, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3614)
  0.14285715 = coord(2/14)
```
Abstract

Purpose - The purpose of this study is twofold: to investigate whether it is meaningful to use the Engineering Index (Ei) classification scheme for browsing, and then, if proven useful, to investigate the performance of an automated classification algorithm based on the Ei classification scheme. Design/methodology/approach - A user study was conducted in which users solved four controlled searching tasks. The users browsed the Ei classification scheme in order to examine the suitability of the classification systems for browsing. The classification algorithm was evaluated by the users who judged the correctness of the automatically assigned classes. Findings - The study showed that the Ei classification scheme is suited for browsing. Automatically assigned classes were on average partly correct, with some classes working better than others. Success of browsing showed to be correlated and dependent on classification correctness. Research limitations/implications - Further research should address problems of disparate evaluations of one and the same web page. Additional reasons behind browsing failures in the Ei classification scheme also need further investigation. Practical implications - Improvements for browsing were identified: describing class captions and/or listing their subclasses from start; allowing for searching for words from class captions with synonym search (easily provided for Ei since the classes are mapped to thesauri terms); when searching for class captions, returning the hierarchical tree expanded around the class in which caption the search term is found. The need for improvements of classification schemes was also indicated. Originality/value - A user-based evaluation of automated subject classification in the context of browsing has not been conducted before; hence the study also presents new findings concerning methodology.

Theme

Klassifikationssysteme im Online-Retrieval
Cortez, E.; Herrera, M.R.; Silva, A.S. da; Moura, E.S. de; Neubert, M.: Lightweight methods for large-scale product categorization (2011) 0.01
```
0.005449971 = product of:
  0.038149796 = sum of:
    0.029588435 = weight(_text_:web in 4758) [ClassicSimilarity], result of:
      0.029588435 = score(doc=4758,freq=4.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.3059541 = fieldWeight in 4758, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=4758)
    0.00856136 = weight(_text_:information in 4758) [ClassicSimilarity], result of:
      0.00856136 = score(doc=4758,freq=4.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.16457605 = fieldWeight in 4758, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=4758)
  0.14285715 = coord(2/14)
```
Abstract

In this article, we present a study about classification methods for large-scale categorization of product offers on e-shopping web sites. We present a study about the performance of previously proposed approaches and deployed a probabilistic approach to model the classification problem. We also studied an alternative way of modeling information about the description of product offers and investigated the usage of price and store of product offers as features adopted in the classification process. Our experiments used two collections of over a million product offers previously categorized by human editors and taxonomies of hundreds of categories from a real e-shopping web site. In these experiments, our method achieved an improvement of up to 9% in the quality of the categorization in comparison with the best baseline we have found.

Source

Journal of the American Society for Information Science and Technology. 62(2011) no.9, S.1839-1848

Search (136 results, page 2 of 7)

Authors

Years

Themes