Search (193 results, page 1 of 10)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.23

0.23187417 = product of:
  0.41737348 = sum of:
    0.056410737 = product of:
      0.1692322 = sum of:
        0.1692322 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
          0.1692322 = score(doc=562,freq=2.0), product of:
            0.30111524 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.035517205 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.33333334 = coord(1/3)
    0.1692322 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.1692322 = score(doc=562,freq=2.0), product of:
        0.30111524 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.035517205 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.012874156 = weight(_text_:of in 562) [ClassicSimilarity], result of:
      0.012874156 = score(doc=562,freq=10.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.23179851 = fieldWeight in 562, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.1692322 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.1692322 = score(doc=562,freq=2.0), product of:
        0.30111524 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.035517205 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.009624182 = product of:
      0.028872546 = sum of:
        0.028872546 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
          0.028872546 = score(doc=562,freq=2.0), product of:
            0.1243752 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.035517205 = queryNorm
            0.23214069 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.33333334 = coord(1/3)
  0.5555556 = coord(5/9)

Abstract: Document representations for text classification are typically based on the classical Bag-Of-Words paradigm. This approach comes with deficiencies that motivate the integration of features on a higher semantic level than single words. In this paper we propose an enhancement of the classical document representation through concepts extracted from background knowledge. Boosting is used for actual classification. Experimental evaluations on two well known text corpora support our approach through consistent improvement of the results.
Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32
Source: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), 1-4 November 2004, Brighton, UK

Losee, R.M.; Haas, S.W.: Sublanguage terms : dictionaries, usage, and automatic classification (1995) 0.03

0.031127628 = product of:
  0.09338288 = sum of:
    0.028725008 = weight(_text_:retrieval in 2650) [ClassicSimilarity], result of:
      0.028725008 = score(doc=2650,freq=2.0), product of:
        0.10743652 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.035517205 = queryNorm
        0.26736724 = fieldWeight in 2650, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0625 = fieldNorm(doc=2650)
    0.041627884 = weight(_text_:use in 2650) [ClassicSimilarity], result of:
      0.041627884 = score(doc=2650,freq=4.0), product of:
        0.10875683 = queryWeight, product of:
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.035517205 = queryNorm
        0.3827611 = fieldWeight in 2650, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.0625 = fieldNorm(doc=2650)
    0.023029989 = weight(_text_:of in 2650) [ClassicSimilarity], result of:
      0.023029989 = score(doc=2650,freq=18.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.41465375 = fieldWeight in 2650, product of:
          4.2426405 = tf(freq=18.0), with freq of:
            18.0 = termFreq=18.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0625 = fieldNorm(doc=2650)
  0.33333334 = coord(3/9)

Abstract: The use of terms from natural and social science titles and abstracts is studied from the perspective of sublanguages and their specialized dictionaries. Explores different notions of sublanguage distinctiveness. Object methods for separating hard and soft sciences are suggested based on measures of sublanguage use, dictionary characteristics, and sublanguage distinctiveness. Abstracts were automatically classified with a high degree of accuracy by using a formula that condsiders the degree of uniqueness of terms in each sublanguage. This may prove useful for text filtering of information retrieval systems
Source: Journal of the American Society for Information Science. 46(1995) no.7, S.519-529

Dolin, R.; Agrawal, D.; El Abbadi, A.; Pearlman, J.: Using automated classification for summarizing and selecting heterogeneous information sources (1998) 0.03

0.028970644 = product of:
  0.08691193 = sum of:
    0.030467471 = weight(_text_:retrieval in 316) [ClassicSimilarity], result of:
      0.030467471 = score(doc=316,freq=4.0), product of:
        0.10743652 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.035517205 = queryNorm
        0.2835858 = fieldWeight in 316, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.046875 = fieldNorm(doc=316)
    0.038237654 = weight(_text_:use in 316) [ClassicSimilarity], result of:
      0.038237654 = score(doc=316,freq=6.0), product of:
        0.10875683 = queryWeight, product of:
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.035517205 = queryNorm
        0.35158852 = fieldWeight in 316, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.046875 = fieldNorm(doc=316)
    0.018206805 = weight(_text_:of in 316) [ClassicSimilarity], result of:
      0.018206805 = score(doc=316,freq=20.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.32781258 = fieldWeight in 316, product of:
          4.472136 = tf(freq=20.0), with freq of:
            20.0 = termFreq=20.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.046875 = fieldNorm(doc=316)
  0.33333334 = coord(3/9)

Abstract: Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. Important sources of information include not only traditional databases with structured data and queries, but also increasing numbers of non-traditional, semi- or unstructured collections such as Web sites, FTP archives, etc. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC) [10], within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. For such a system to be useful in a large-scale distributed environment, it must be easy to use for both collection managers and users. As a result, it must be possible to classify documents automatically within a classification scheme. Furthermore, there must be a straightforward and intuitive interface with which the user may use the scheme to assist in information retrieval (IR).

Huang, Y.-L.: ¬A theoretic and empirical research of cluster indexing for Mandarine Chinese full text document (1998) 0.03

0.027236674 = product of:
  0.08171002 = sum of:
    0.02513438 = weight(_text_:retrieval in 513) [ClassicSimilarity], result of:
      0.02513438 = score(doc=513,freq=2.0), product of:
        0.10743652 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.035517205 = queryNorm
        0.23394634 = fieldWeight in 513, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0546875 = fieldNorm(doc=513)
    0.0364244 = weight(_text_:use in 513) [ClassicSimilarity], result of:
      0.0364244 = score(doc=513,freq=4.0), product of:
        0.10875683 = queryWeight, product of:
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.035517205 = queryNorm
        0.33491597 = fieldWeight in 513, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.0546875 = fieldNorm(doc=513)
    0.02015124 = weight(_text_:of in 513) [ClassicSimilarity], result of:
      0.02015124 = score(doc=513,freq=18.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.36282203 = fieldWeight in 513, product of:
          4.2426405 = tf(freq=18.0), with freq of:
            18.0 = termFreq=18.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0546875 = fieldNorm(doc=513)
  0.33333334 = coord(3/9)

Abstract: Since most popular commercialized systems for full text retrieval are designed with full text scaning and Boolean logic query mode, these systems use an oversimplified relationship between the indexing form and the content of document. Reports the use of Singular Value Decomposition (SVD) to develop a Cluster Indexing Model (CIM) based on a Vector Space Model (VSM) in orer to explore the index theory of cluster indexing for chinese full text documents. From a series of experiments, it was found that the indexing performance of CIM is better than traditional VSM, and has almost equivalent effectiveness of the authority control of index terms
Source: Bulletin of library and information science. 1998, no.24, S.44-68

Cui, H.; Heidorn, P.B.; Zhang, H.: ¬An approach to automatic classification of text for information retrieval (2002) 0.03

0.025440387 = product of:
  0.07632116 = sum of:
    0.03554538 = weight(_text_:retrieval in 174) [ClassicSimilarity], result of:
      0.03554538 = score(doc=174,freq=4.0), product of:
        0.10743652 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.035517205 = queryNorm
        0.33085006 = fieldWeight in 174, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0546875 = fieldNorm(doc=174)
    0.025755936 = weight(_text_:use in 174) [ClassicSimilarity], result of:
      0.025755936 = score(doc=174,freq=2.0), product of:
        0.10875683 = queryWeight, product of:
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.035517205 = queryNorm
        0.23682132 = fieldWeight in 174, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.0546875 = fieldNorm(doc=174)
    0.01501985 = weight(_text_:of in 174) [ClassicSimilarity], result of:
      0.01501985 = score(doc=174,freq=10.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.2704316 = fieldWeight in 174, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0546875 = fieldNorm(doc=174)
  0.33333334 = coord(3/9)

Abstract: In this paper, we explore an approach to make better use of semi-structured documents in information retrieval in the domain of biology. Using machine learning techniques, we make those inherent structures explicit by XML markups. This marking up has great potentials in improving task performance in specimen identification and the usability of online flora and fauna.
Source: Proceedings of the Second ACM/IEEE-CS Joint Conference on Digital Libraries : JCDL 2002 ; July 14 - 18, 2002, Portland, Oregon, USA. Ed. by Gary Marchionini

Godby, C. J.; Stuler, J.: ¬The Library of Congress Classification as a knowledge base for automatic subject categorization (2001) 0.03

0.025108635 = product of:
  0.07532591 = sum of:
    0.028725008 = weight(_text_:retrieval in 1567) [ClassicSimilarity], result of:
      0.028725008 = score(doc=1567,freq=2.0), product of:
        0.10743652 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.035517205 = queryNorm
        0.26736724 = fieldWeight in 1567, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0625 = fieldNorm(doc=1567)
    0.029435357 = weight(_text_:use in 1567) [ClassicSimilarity], result of:
      0.029435357 = score(doc=1567,freq=2.0), product of:
        0.10875683 = queryWeight, product of:
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.035517205 = queryNorm
        0.27065295 = fieldWeight in 1567, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.0625 = fieldNorm(doc=1567)
    0.017165542 = weight(_text_:of in 1567) [ClassicSimilarity], result of:
      0.017165542 = score(doc=1567,freq=10.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.3090647 = fieldWeight in 1567, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0625 = fieldNorm(doc=1567)
  0.33333334 = coord(3/9)

Abstract: This paper describes a set of experiments in adapting a subset of the Library of Congress Classification for use as a database for automatic classification. A high degree of concept integrity was obtained when subject headings were mapped from OCLC's WorldCat database and filtered using the log-likelihood statistic
Footnote: Paper, IFLA Preconference "Subject Retrieval in a Networked Environment", Dublin, OH, August 2001.

Adamson, G.W.; Boreham, J.: ¬The use of an association measure based on character structure to identify semantically related pairs of words and document titles (1974) 0.02

0.024389451 = product of:
  0.07316835 = sum of:
    0.02513438 = weight(_text_:retrieval in 398) [ClassicSimilarity], result of:
      0.02513438 = score(doc=398,freq=2.0), product of:
        0.10743652 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.035517205 = queryNorm
        0.23394634 = fieldWeight in 398, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0546875 = fieldNorm(doc=398)
    0.025755936 = weight(_text_:use in 398) [ClassicSimilarity], result of:
      0.025755936 = score(doc=398,freq=2.0), product of:
        0.10875683 = queryWeight, product of:
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.035517205 = queryNorm
        0.23682132 = fieldWeight in 398, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.0546875 = fieldNorm(doc=398)
    0.022278037 = weight(_text_:of in 398) [ClassicSimilarity], result of:
      0.022278037 = score(doc=398,freq=22.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.40111488 = fieldWeight in 398, product of:
          4.690416 = tf(freq=22.0), with freq of:
            22.0 = termFreq=22.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0546875 = fieldNorm(doc=398)
  0.33333334 = coord(3/9)

Abstract: An automatic classification technique has been developed, based on the character structure of words. Dice's similarity coefficient is computed from the number of matching diagrams in pairs of character strings, and used to cluster sets of character strings. A sample of words from a chemical data base was chosen to contain certain stems derived from the names of chemical elements. They were successfully clusterd into groups of semantically related words. Each cluster is characterised by the root word from which all its members are derived. A second example of titles from Mathematical Reviews was clustered into well-defined classes, which compare favourably with the subject groupings of Mathematical Reviews
Source: Information storage and retrieval. 10(1974), S.253-260

AlQenaei, Z.M.; Monarchi, D.E.: ¬The use of learning techniques to analyze the results of a manual classification system (2016) 0.02

0.023119714 = product of:
  0.06935914 = sum of:
    0.025389558 = weight(_text_:retrieval in 2836) [ClassicSimilarity], result of:
      0.025389558 = score(doc=2836,freq=4.0), product of:
        0.10743652 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.035517205 = queryNorm
        0.23632148 = fieldWeight in 2836, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2836)
    0.026017427 = weight(_text_:use in 2836) [ClassicSimilarity], result of:
      0.026017427 = score(doc=2836,freq=4.0), product of:
        0.10875683 = queryWeight, product of:
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.035517205 = queryNorm
        0.23922569 = fieldWeight in 2836, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2836)
    0.017952153 = weight(_text_:of in 2836) [ClassicSimilarity], result of:
      0.017952153 = score(doc=2836,freq=28.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.32322758 = fieldWeight in 2836, product of:
          5.2915025 = tf(freq=28.0), with freq of:
            28.0 = termFreq=28.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2836)
  0.33333334 = coord(3/9)

Abstract: Classification is the process of assigning objects to pre-defined classes based on observations or characteristics of those objects, and there are many approaches to performing this task. The overall objective of this study is to demonstrate the use of two learning techniques to analyze the results of a manual classification system. Our sample consisted of 1,026 documents, from the ACM Computing Classification System, classified by their authors as belonging to one of the groups of the classification system: "H.3 Information Storage and Retrieval." A singular value decomposition of the documents' weighted term-frequency matrix was used to represent each document in a 50-dimensional vector space. The analysis of the representation using both supervised (decision tree) and unsupervised (clustering) techniques suggests that two pairs of the ACM classes are closely related to each other in the vector space. Class 1 (Content Analysis and Indexing) is closely related to Class 3 (Information Search and Retrieval), and Class 4 (Systems and Software) is closely related to Class 5 (Online Information Services). Further analysis was performed to test the diffusion of the words in the two classes using both cosine and Euclidean distance.

Godby, C.J.; Stuler, J.: ¬The Library of Congress Classification as a knowledge base for automatic subject categorization : subject access issues (2003) 0.02

0.022447914 = product of:
  0.06734374 = sum of:
    0.02513438 = weight(_text_:retrieval in 3962) [ClassicSimilarity], result of:
      0.02513438 = score(doc=3962,freq=2.0), product of:
        0.10743652 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.035517205 = queryNorm
        0.23394634 = fieldWeight in 3962, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0546875 = fieldNorm(doc=3962)
    0.025755936 = weight(_text_:use in 3962) [ClassicSimilarity], result of:
      0.025755936 = score(doc=3962,freq=2.0), product of:
        0.10875683 = queryWeight, product of:
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.035517205 = queryNorm
        0.23682132 = fieldWeight in 3962, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.0546875 = fieldNorm(doc=3962)
    0.016453419 = weight(_text_:of in 3962) [ClassicSimilarity], result of:
      0.016453419 = score(doc=3962,freq=12.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.29624295 = fieldWeight in 3962, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0546875 = fieldNorm(doc=3962)
  0.33333334 = coord(3/9)

Abstract: This paper describes a set of experiments in adapting a subset of the Library of Congress Classification for use as a database for automatic classification. A high degree of concept integrity was obtained when subject headings were mapped from OCLC's WorldCat database and filtered using the log-likelihood statistic.
Source: Subject retrieval in a networked environment: Proceedings of the IFLA Satellite Meeting held in Dublin, OH, 14-16 August 2001 and sponsored by the IFLA Classification and Indexing Section, the IFLA Information Technology Section and OCLC. Ed.: I.C. McIlwaine

Mengle, S.; Goharian, N.: Passage detection using text classification (2009) 0.02
```
0.021821234 = product of:
  0.0654637 = sum of:
    0.040144417 = weight(_text_:retrieval in 2765) [ClassicSimilarity], result of:
      0.040144417 = score(doc=2765,freq=10.0), product of:
        0.10743652 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.035517205 = queryNorm
        0.37365708 = fieldWeight in 2765, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2765)
    0.017299127 = weight(_text_:of in 2765) [ClassicSimilarity], result of:
      0.017299127 = score(doc=2765,freq=26.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.31146988 = fieldWeight in 2765, product of:
          5.0990195 = tf(freq=26.0), with freq of:
            26.0 = termFreq=26.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2765)
    0.008020152 = product of:
      0.024060456 = sum of:
        0.024060456 = weight(_text_:22 in 2765) [ClassicSimilarity], result of:
          0.024060456 = score(doc=2765,freq=2.0), product of:
            0.1243752 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.035517205 = queryNorm
            0.19345059 = fieldWeight in 2765, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2765)
      0.33333334 = coord(1/3)
  0.33333334 = coord(3/9)
```
Abstract

Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retrieval is the determination of the degree of relevance of blocks of text, namely passages, comprising a document. Rather than determining the relevance of a document in its entirety, passage retrieval determines the relevance of the individual passages. As such, modified traditional information-retrieval techniques compare terms found in user queries with the individual passages to determine a similarity score for passages of interest. In passage detection, passages are classified into predetermined categories. More often than not, passage detection techniques are deployed to detect hidden paragraphs in documents. That is, to hide information, documents are injected with hidden text into passages. Rather than matching query terms against passages to determine their relevance, using text-mining techniques, the passages are classified. Those documents with hidden passages are defined as infected. Thus, simply stated, passage retrieval is the search for passages relevant to a user query, while passage detection is the classification of passages. That is, in passage detection, passages are labeled with one or more categories from a set of predetermined categories. We present a keyword-based dynamic passage approach (KDP) and demonstrate that KDP outperforms statistically significantly (99% confidence) the other document-splitting approaches by 12% to 18% in the passage detection and passage category-prediction tasks. Furthermore, we evaluate the effects of the feature selection, passage length, ambiguous passages, and finally training-data category distribution on passage-detection accuracy.

Date

22. 3.2009 19:14:43

Source

Journal of the American Society for Information Science and Technology. 60(2009) no.4, S.814-825

Golub, K.; Soergel, D.; Buchanan, G.; Tudhope, D.; Lykke, M.; Hiom, D.: ¬A framework for evaluating automatic indexing or classification in the context of retrieval (2016) 0.02

0.021021128 = product of:
  0.06306338 = sum of:
    0.031095734 = weight(_text_:retrieval in 3311) [ClassicSimilarity], result of:
      0.031095734 = score(doc=3311,freq=6.0), product of:
        0.10743652 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.035517205 = queryNorm
        0.28943354 = fieldWeight in 3311, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3311)
    0.018397098 = weight(_text_:use in 3311) [ClassicSimilarity], result of:
      0.018397098 = score(doc=3311,freq=2.0), product of:
        0.10875683 = queryWeight, product of:
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.035517205 = queryNorm
        0.1691581 = fieldWeight in 3311, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3311)
    0.013570553 = weight(_text_:of in 3311) [ClassicSimilarity], result of:
      0.013570553 = score(doc=3311,freq=16.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.24433708 = fieldWeight in 3311, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3311)
  0.33333334 = coord(3/9)

Abstract: Tools for automatic subject assignment help deal with scale and sustainability in creating and enriching metadata, establishing more connections across and between resources and enhancing consistency. Although some software vendors and experimental researchers claim the tools can replace manual subject indexing, hard scientific evidence of their performance in operating information environments is scarce. A major reason for this is that research is usually conducted in laboratory conditions, excluding the complexities of real-life systems and situations. The article reviews and discusses issues with existing evaluation approaches such as problems of aboutness and relevance assessments, implying the need to use more than a single "gold standard" method when evaluating indexing and retrieval, and proposes a comprehensive evaluation framework. The framework is informed by a systematic review of the literature on evaluation approaches: evaluating indexing quality directly through assessment by an evaluator or through comparison with a gold standard, evaluating the quality of computer-assisted indexing directly in the context of an indexing workflow, and evaluating indexing quality indirectly through analyzing retrieval performance.
Source: Journal of the Association for Information Science and Technology. 67(2016) no.1, S.3-16

Schiminovich, S.: Automatic classification and retrieval of documents by means of a bibliographic pattern discovery algorithm (1971) 0.02

0.020019896 = product of:
  0.09008953 = sum of:
    0.07109076 = weight(_text_:retrieval in 4846) [ClassicSimilarity], result of:
      0.07109076 = score(doc=4846,freq=4.0), product of:
        0.10743652 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.035517205 = queryNorm
        0.6617001 = fieldWeight in 4846, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.109375 = fieldNorm(doc=4846)
    0.018998774 = weight(_text_:of in 4846) [ClassicSimilarity], result of:
      0.018998774 = score(doc=4846,freq=4.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.34207192 = fieldWeight in 4846, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.109375 = fieldNorm(doc=4846)
  0.22222222 = coord(2/9)

Source: Information storage and retrieval. 6(1971), S.417-435

Finn, A.; Kushmerick, N.: Learning to classify documents according to genre (2006) 0.02

0.019968312 = product of:
  0.059904933 = sum of:
    0.021543756 = weight(_text_:retrieval in 6010) [ClassicSimilarity], result of:
      0.021543756 = score(doc=6010,freq=2.0), product of:
        0.10743652 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.035517205 = queryNorm
        0.20052543 = fieldWeight in 6010, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.046875 = fieldNorm(doc=6010)
    0.022076517 = weight(_text_:use in 6010) [ClassicSimilarity], result of:
      0.022076517 = score(doc=6010,freq=2.0), product of:
        0.10875683 = queryWeight, product of:
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.035517205 = queryNorm
        0.20298971 = fieldWeight in 6010, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.046875 = fieldNorm(doc=6010)
    0.016284661 = weight(_text_:of in 6010) [ClassicSimilarity], result of:
      0.016284661 = score(doc=6010,freq=16.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.2932045 = fieldWeight in 6010, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.046875 = fieldNorm(doc=6010)
  0.33333334 = coord(3/9)

Abstract: Current document-retrieval tools succeed in locating large numbers of documents relevant to a given query. While search results may be relevant according to the topic of the documents, it is more difficult to identify which of the relevant documents are most suitable for a particular user. Automatic genre analysis (i.e., the ability to distinguish documents according to style) would be a useful tool for identifying documents that are most suitable for a particular user. We investigate the use of machine learning for automatic genre classification. We introduce the idea of domain transfer-genre classifiers should be reusable across multiple topics-which does not arise in standard text classification. We investigate different features for building genre classifiers and their ability to transfer across multiple-topic domains. We also show how different feature-sets can be used in conjunction with each other to improve performance and reduce the number of documents that need to be labeled.
Footnote: Beitrag in einem Themenschwerpunkt "Computational analysis of style"
Source: Journal of the American Society for Information Science and Technology. 57(2006) no.11, S.1506-1518

Dolin, R.; Agrawal, D.; El Abbadi, A.; Pearlman, J.: Using automated classification for summarizing and selecting heterogeneous information sources (1998) 0.02
```
0.019346466 = product of:
  0.058039397 = sum of:
    0.015233736 = weight(_text_:retrieval in 1253) [ClassicSimilarity], result of:
      0.015233736 = score(doc=1253,freq=4.0), product of:
        0.10743652 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.035517205 = queryNorm
        0.1417929 = fieldWeight in 1253, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0234375 = fieldNorm(doc=1253)
    0.027038103 = weight(_text_:use in 1253) [ClassicSimilarity], result of:
      0.027038103 = score(doc=1253,freq=12.0), product of:
        0.10875683 = queryWeight, product of:
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.035517205 = queryNorm
        0.24861062 = fieldWeight in 1253, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.0234375 = fieldNorm(doc=1253)
    0.015767558 = weight(_text_:of in 1253) [ClassicSimilarity], result of:
      0.015767558 = score(doc=1253,freq=60.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.28389403 = fieldWeight in 1253, product of:
          7.745967 = tf(freq=60.0), with freq of:
            60.0 = termFreq=60.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0234375 = fieldNorm(doc=1253)
  0.33333334 = coord(3/9)
```
Abstract

Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. Important sources of information include not only traditional databases with structured data and queries, but also increasing numbers of non-traditional, semi- or unstructured collections such as Web sites, FTP archives, etc. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC), within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. For such a system to be useful in a large-scale distributed environment, it must be easy to use for both collection managers and users. As a result, it must be possible to classify documents automatically within a classification scheme. Furthermore, there must be a straightforward and intuitive interface with which the user may use the scheme to assist in information retrieval (IR). Our work with the Alexandria Digital Library (ADL) Project focuses on geo-referenced information, whether text, maps, aerial photographs, or satellite images. As a result, we have emphasized techniques which work with both text and non-text, such as combined textual and graphical queries, multi-dimensional indexing, and IR methods which are not solely dependent on words or phrases. Part of this work involves locating relevant online sources of information. In particular, we have designed and are currently testing aspects of an architecture, Pharos, which we believe will scale up to 1.000.000 heterogeneous sources. Pharos accommodates heterogeneity in content and format, both among multiple sources as well as within a single source. That is, we consider sources to include Web sites, FTP archives, newsgroups, and full digital libraries; all of these systems can include a wide variety of content and multimedia data formats. Pharos is based on the use of hierarchical classification schemes. These include not only well-known 'subject' (or 'concept') based schemes such as the Dewey Decimal System and the LCC, but also, for example, geographic classifications, which might be constructed as layers of smaller and smaller hierarchical longitude/latitude boxes. Pharos is designed to work with sophisticated queries which utilize subjects, geographical locations, temporal specifications, and other types of information domains. The Pharos architecture requires that hierarchically structured collection metadata be extracted so that it can be partitioned in such a way as to greatly enhance scalability. Automated classification is important to Pharos because it allows information sources to extract the requisite collection metadata automatically that must be distributed.
We are currently experimenting with newsgroups as collections. We have built an initial prototype which automatically classifies and summarizes newsgroups within the LCC. (The prototype can be tested below, and more details may be found at http://pharos.alexandria.ucsb.edu/). The prototype uses electronic library catalog records as a `training set' and Latent Semantic Indexing (LSI) for IR. We use the training set to build a rich set of classification terminology, and associate these terms with the relevant categories in the LCC. This association between terms and classification categories allows us to relate users' queries to nodes in the LCC so that users can select appropriate query categories. Newsgroups are similarly associated with classification categories. Pharos then matches the categories selected by users to relevant newsgroups. In principle, this approach allows users to exclude newsgroups that might have been selected based on an unintended meaning of a query term, and to include newsgroups with relevant content even though the exact query terms may not have been used. This work is extensible to other types of classification, including geographical, temporal, and image feature. Before discussing the methodology of the collection summarization and selection, we first present an online demonstration below. The demonstration is not intended to be a complete end-user interface. Rather, it is intended merely to offer a view of the process to suggest the "look and feel" of the prototype. The demo works as follows. First supply it with a few keywords of interest. The system will then use those terms to try to return to you the most relevant subject categories within the LCC. Assuming that the system recognizes any of your terms (it has over 400,000 terms indexed), it will give you a list of 15 LCC categories sorted by relevancy ranking. From there, you have two choices. The first choice, by clicking on the "News" links, is to get a list of newsgroups which the system has identified as relevant to the LCC category you select. The other choice, by clicking on the LCC ID links, is to enter the LCC hierarchy starting at the category of your choice and navigate the tree until you locate the best category for your query. From there, again, you can get a list of newsgroups by clicking on the "News" links. After having shown this demonstration to many people, we would like to suggest that you first give it easier examples before trying to break it. For example, "prostate cancer" (discussed below), "remote sensing", "investment banking", and "gershwin" all work reasonably well.

Ruiz, M.E.; Srinivasan, P.: Combining machine learning and hierarchical indexing structures for text categorization (2001) 0.02

0.017846499 = product of:
  0.053539492 = sum of:
    0.025755936 = weight(_text_:use in 1595) [ClassicSimilarity], result of:
      0.025755936 = score(doc=1595,freq=2.0), product of:
        0.10875683 = queryWeight, product of:
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.035517205 = queryNorm
        0.23682132 = fieldWeight in 1595, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1595)
    0.016453419 = weight(_text_:of in 1595) [ClassicSimilarity], result of:
      0.016453419 = score(doc=1595,freq=12.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.29624295 = fieldWeight in 1595, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1595)
    0.011330134 = product of:
      0.0339904 = sum of:
        0.0339904 = weight(_text_:29 in 1595) [ClassicSimilarity], result of:
          0.0339904 = score(doc=1595,freq=2.0), product of:
            0.12493842 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.035517205 = queryNorm
            0.27205724 = fieldWeight in 1595, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1595)
      0.33333334 = coord(1/3)
  0.33333334 = coord(3/9)

Abstract: This paper presents a method that exploits the hierarchical structure of an indexing vocabulary to guide the development and training of machine learning methods for automatic text categorization. We present the design of a hierarchical classifier based an the divide-and-conquer principle. The method is evaluated using backpropagation neural networks, such as the machine learning algorithm, that leam to assign MeSH categories to a subset of MEDLINE records. Comparisons with traditional Rocchio's algorithm adapted for text categorization, as well as flat neural network classifiers, are provided. The results indicate that the use of hierarchical structures improves Performance significantly.
Date: 11. 5.2003 18:29:44
Source: Advances in classification research, vol.10: proceedings of the 10th ASIS SIG/CR Classification Research Workshop. Ed.: Albrechtsen, H. u. J.E. Mai

Golub, K.; Lykke, M.: Automated classification of web pages in hierarchical browsing (2009) 0.02
```
0.017656898 = product of:
  0.052970693 = sum of:
    0.01795313 = weight(_text_:retrieval in 3614) [ClassicSimilarity], result of:
      0.01795313 = score(doc=3614,freq=2.0), product of:
        0.10743652 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.035517205 = queryNorm
        0.16710453 = fieldWeight in 3614, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3614)
    0.018397098 = weight(_text_:use in 3614) [ClassicSimilarity], result of:
      0.018397098 = score(doc=3614,freq=2.0), product of:
        0.10875683 = queryWeight, product of:
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.035517205 = queryNorm
        0.1691581 = fieldWeight in 3614, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3614)
    0.016620465 = weight(_text_:of in 3614) [ClassicSimilarity], result of:
      0.016620465 = score(doc=3614,freq=24.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.2992506 = fieldWeight in 3614, product of:
          4.8989797 = tf(freq=24.0), with freq of:
            24.0 = termFreq=24.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3614)
  0.33333334 = coord(3/9)
```
Abstract

Purpose - The purpose of this study is twofold: to investigate whether it is meaningful to use the Engineering Index (Ei) classification scheme for browsing, and then, if proven useful, to investigate the performance of an automated classification algorithm based on the Ei classification scheme. Design/methodology/approach - A user study was conducted in which users solved four controlled searching tasks. The users browsed the Ei classification scheme in order to examine the suitability of the classification systems for browsing. The classification algorithm was evaluated by the users who judged the correctness of the automatically assigned classes. Findings - The study showed that the Ei classification scheme is suited for browsing. Automatically assigned classes were on average partly correct, with some classes working better than others. Success of browsing showed to be correlated and dependent on classification correctness. Research limitations/implications - Further research should address problems of disparate evaluations of one and the same web page. Additional reasons behind browsing failures in the Ei classification scheme also need further investigation. Practical implications - Improvements for browsing were identified: describing class captions and/or listing their subclasses from start; allowing for searching for words from class captions with synonym search (easily provided for Ei since the classes are mapped to thesauri terms); when searching for class captions, returning the hierarchical tree expanded around the class in which caption the search term is found. The need for improvements of classification schemes was also indicated. Originality/value - A user-based evaluation of automated subject classification in the context of browsing has not been conducted before; hence the study also presents new findings concerning methodology.

Source

Journal of documentation. 65(2009) no.6, S.901-925

Theme

Klassifikationssysteme im Online-Retrieval

Jenkins, C.: Automatic classification of Web resources using Java and Dewey Decimal Classification (1998) 0.02

0.017605338 = product of:
  0.05281601 = sum of:
    0.02513438 = weight(_text_:retrieval in 1673) [ClassicSimilarity], result of:
      0.02513438 = score(doc=1673,freq=2.0), product of:
        0.10743652 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.035517205 = queryNorm
        0.23394634 = fieldWeight in 1673, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1673)
    0.016453419 = weight(_text_:of in 1673) [ClassicSimilarity], result of:
      0.016453419 = score(doc=1673,freq=12.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.29624295 = fieldWeight in 1673, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1673)
    0.011228213 = product of:
      0.033684637 = sum of:
        0.033684637 = weight(_text_:22 in 1673) [ClassicSimilarity], result of:
          0.033684637 = score(doc=1673,freq=2.0), product of:
            0.1243752 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.035517205 = queryNorm
            0.2708308 = fieldWeight in 1673, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1673)
      0.33333334 = coord(1/3)
  0.33333334 = coord(3/9)

Abstract: The Wolverhampton Web Library (WWLib) is a WWW search engine that provides access to UK based information. The experimental version developed in 1995, was a success but highlighted the need for a much higher degree of automation. An interesting feature of the experimental WWLib was that it organised information according to DDC. Discusses the advantages of classification and describes the automatic classifier that is being developed in Java as part of the new, fully automated WWLib
Date: 1. 8.1996 22:08:06
Footnote: Contribution to a special issue devoted to the Proceedings of the 7th International World Wide Web Conference, held 14-18 April 1998, Brisbane, Australia; vgl. auch: http://www7.scu.edu.au/programme/posters/1846/com1846.htm.
Theme: Klassifikationssysteme im Online-Retrieval

Ruocco, A.S.; Frieder, O.: Clustering and classification of large document bases in a parallel environment (1997) 0.02

0.01736864 = product of:
  0.052105922 = sum of:
    0.025755936 = weight(_text_:use in 1661) [ClassicSimilarity], result of:
      0.025755936 = score(doc=1661,freq=2.0), product of:
        0.10875683 = queryWeight, product of:
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.035517205 = queryNorm
        0.23682132 = fieldWeight in 1661, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1661)
    0.01501985 = weight(_text_:of in 1661) [ClassicSimilarity], result of:
      0.01501985 = score(doc=1661,freq=10.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.2704316 = fieldWeight in 1661, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1661)
    0.011330134 = product of:
      0.0339904 = sum of:
        0.0339904 = weight(_text_:29 in 1661) [ClassicSimilarity], result of:
          0.0339904 = score(doc=1661,freq=2.0), product of:
            0.12493842 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.035517205 = queryNorm
            0.27205724 = fieldWeight in 1661, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1661)
      0.33333334 = coord(1/3)
  0.33333334 = coord(3/9)

Abstract: Proposes the use of parallel computing systems to overcome the computationally intense clustering process. Examines 2 operations: clustering a document set and classifying the document set. Uses a subset of the TIPSTER corpus, specifically, articles from the Wall Street Journal. Document set classification was performed without the large storage requirements for ancillary data matrices. The time performance of the parallel systems was an improvement over sequential systems times, and produced the same clustering and classification scheme. Results show near linear speed up in higher threshold clustering applications
Date: 29. 7.1998 17:45:02
Source: Journal of the American Society for Information Science. 48(1997) no.10, S.932-943

Yoon, Y.; Lee, C.; Lee, G.G.: ¬An effective procedure for constructing a hierarchical text classification system (2006) 0.02

0.017334668 = product of:
  0.052004002 = sum of:
    0.025755936 = weight(_text_:use in 5273) [ClassicSimilarity], result of:
      0.025755936 = score(doc=5273,freq=2.0), product of:
        0.10875683 = queryWeight, product of:
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.035517205 = queryNorm
        0.23682132 = fieldWeight in 5273, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5273)
    0.01501985 = weight(_text_:of in 5273) [ClassicSimilarity], result of:
      0.01501985 = score(doc=5273,freq=10.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.2704316 = fieldWeight in 5273, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5273)
    0.011228213 = product of:
      0.033684637 = sum of:
        0.033684637 = weight(_text_:22 in 5273) [ClassicSimilarity], result of:
          0.033684637 = score(doc=5273,freq=2.0), product of:
            0.1243752 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.035517205 = queryNorm
            0.2708308 = fieldWeight in 5273, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=5273)
      0.33333334 = coord(1/3)
  0.33333334 = coord(3/9)

Abstract: In text categorization tasks, classification on some class hierarchies has better results than in cases without the hierarchy. Currently, because a large number of documents are divided into several subgroups in a hierarchy, we can appropriately use a hierarchical classification method. However, we have no systematic method to build a hierarchical classification system that performs well with large collections of practical data. In this article, we introduce a new evaluation scheme for internal node classifiers, which can be used effectively to develop a hierarchical classification system. We also show that our method for constructing the hierarchical classification system is very effective, especially for the task of constructing classifiers applied to hierarchy tree with a lot of levels.
Date: 22. 7.2006 16:24:52
Source: Journal of the American Society for Information Science and Technology. 57(2006) no.3, S.431-442

Calado, P.; Cristo, M.; Gonçalves, M.A.; Moura, E.S. de; Ribeiro-Neto, B.; Ziviani, N.: Link-based similarity measures for the classification of Web documents (2006) 0.02

0.016914658 = product of:
  0.05074397 = sum of:
    0.01795313 = weight(_text_:retrieval in 4921) [ClassicSimilarity], result of:
      0.01795313 = score(doc=4921,freq=2.0), product of:
        0.10743652 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.035517205 = queryNorm
        0.16710453 = fieldWeight in 4921, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4921)
    0.018397098 = weight(_text_:use in 4921) [ClassicSimilarity], result of:
      0.018397098 = score(doc=4921,freq=2.0), product of:
        0.10875683 = queryWeight, product of:
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.035517205 = queryNorm
        0.1691581 = fieldWeight in 4921, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.0620887 = idf(docFreq=5623, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4921)
    0.014393743 = weight(_text_:of in 4921) [ClassicSimilarity], result of:
      0.014393743 = score(doc=4921,freq=18.0), product of:
        0.05554029 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.035517205 = queryNorm
        0.25915858 = fieldWeight in 4921, product of:
          4.2426405 = tf(freq=18.0), with freq of:
            18.0 = termFreq=18.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4921)
  0.33333334 = coord(3/9)

Abstract: Traditional text-based document classifiers tend to perform poorly an the Web. Text in Web documents is usually noisy and often does not contain enough information to determine their topic. However, the Web provides a different source that can be useful to document classification: its hyperlink structure. In this work, the authors evaluate how the link structure of the Web can be used to determine a measure of similarity appropriate for document classification. They experiment with five different similarity measures and determine their adequacy for predicting the topic of a Web page. Tests performed an a Web directory Show that link information alone allows classifying documents with an average precision of 86%. Further, when combined with a traditional textbased classifier, precision increases to values of up to 90%, representing gains that range from 63 to 132% over the use of text-based classification alone. Because the measures proposed in this article are straightforward to compute, they provide a practical and effective solution for Web classification and related information retrieval tasks. Further, the authors provide an important set of guidelines an how link structure can be used effectively to classify Web documents.
Source: Journal of the American Society for Information Science and Technology. 57(2006) no.2, S.208-221

Search (193 results, page 1 of 10)

Authors

Years

Languages

Types

Themes

Subjects