Search (119 results, page 1 of 6)

  • × theme_ss:"Automatisches Klassifizieren"
  1. Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.33
    0.3339376 = product of:
      0.41742197 = sum of:
        0.0735167 = product of:
          0.22055008 = sum of:
            0.22055008 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
              0.22055008 = score(doc=562,freq=2.0), product of:
                0.39242527 = queryWeight, product of:
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.04628742 = queryNorm
                0.56201804 = fieldWeight in 562, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.046875 = fieldNorm(doc=562)
          0.33333334 = coord(1/3)
        0.22055008 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
          0.22055008 = score(doc=562,freq=2.0), product of:
            0.39242527 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04628742 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.05304678 = weight(_text_:semantic in 562) [ClassicSimilarity], result of:
          0.05304678 = score(doc=562,freq=2.0), product of:
            0.19245663 = queryWeight, product of:
              4.1578603 = idf(docFreq=1879, maxDocs=44218)
              0.04628742 = queryNorm
            0.2756298 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.1578603 = idf(docFreq=1879, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.07030838 = sum of:
          0.03268054 = weight(_text_:web in 562) [ClassicSimilarity], result of:
            0.03268054 = score(doc=562,freq=2.0), product of:
              0.15105948 = queryWeight, product of:
                3.2635105 = idf(docFreq=4597, maxDocs=44218)
                0.04628742 = queryNorm
              0.21634221 = fieldWeight in 562, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.2635105 = idf(docFreq=4597, maxDocs=44218)
                0.046875 = fieldNorm(doc=562)
          0.03762784 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
            0.03762784 = score(doc=562,freq=2.0), product of:
              0.16209066 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.04628742 = queryNorm
              0.23214069 = fieldWeight in 562, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.046875 = fieldNorm(doc=562)
      0.8 = coord(4/5)
    
    Abstract
    Document representations for text classification are typically based on the classical Bag-Of-Words paradigm. This approach comes with deficiencies that motivate the integration of features on a higher semantic level than single words. In this paper we propose an enhancement of the classical document representation through concepts extracted from background knowledge. Boosting is used for actual classification. Experimental evaluations on two well known text corpora support our approach through consistent improvement of the results.
    Content
    Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
    Date
    8. 1.2013 10:22:32
  2. Yao, H.; Etzkorn, L.H.; Virani, S.: Automated classification and retrieval of reusable software components (2008) 0.09
    0.089293584 = product of:
      0.14882264 = sum of:
        0.04679445 = weight(_text_:retrieval in 1382) [ClassicSimilarity], result of:
          0.04679445 = score(doc=1382,freq=8.0), product of:
            0.14001551 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.04628742 = queryNorm
            0.33420905 = fieldWeight in 1382, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1382)
        0.0884113 = weight(_text_:semantic in 1382) [ClassicSimilarity], result of:
          0.0884113 = score(doc=1382,freq=8.0), product of:
            0.19245663 = queryWeight, product of:
              4.1578603 = idf(docFreq=1879, maxDocs=44218)
              0.04628742 = queryNorm
            0.45938298 = fieldWeight in 1382, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              4.1578603 = idf(docFreq=1879, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1382)
        0.013616893 = product of:
          0.027233787 = sum of:
            0.027233787 = weight(_text_:web in 1382) [ClassicSimilarity], result of:
              0.027233787 = score(doc=1382,freq=2.0), product of:
                0.15105948 = queryWeight, product of:
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.04628742 = queryNorm
                0.18028519 = fieldWeight in 1382, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1382)
          0.5 = coord(1/2)
      0.6 = coord(3/5)
    
    Abstract
    The authors describe their research which improves software reuse by using an automated approach to semantically search for and retrieve reusable software components in large software component repositories and on the World Wide Web (WWW). Using automation and smart (semantic) techniques, their approach speeds up the search and retrieval of reusable software components, while retaining good accuracy, and therefore improves the affordability of software reuse. A program understanding of software components and natural language understanding of user queries was employed. Then the software component descriptions were compared by matching the resulting semantic representations of the user queries to the semantic representations of the software components to search for software components that best match the user queries. A proof of concept system was developed to test the authors' approach. The results of this proof of concept system were compared to human experts, and statistical analysis was performed on the collected experimental data. The results from these experiments demonstrate that this automated semantic-based approach for software reusable component classification and retrieval is successful when compared to the labor-intensive results from the experts, thus showing that this approach can significantly benefit software reuse classification and retrieval.
  3. Jenkins, C.: Automatic classification of Web resources using Java and Dewey Decimal Classification (1998) 0.06
    0.057077475 = product of:
      0.14269368 = sum of:
        0.032756116 = weight(_text_:retrieval in 1673) [ClassicSimilarity], result of:
          0.032756116 = score(doc=1673,freq=2.0), product of:
            0.14001551 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.04628742 = queryNorm
            0.23394634 = fieldWeight in 1673, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1673)
        0.10993756 = sum of:
          0.06603842 = weight(_text_:web in 1673) [ClassicSimilarity], result of:
            0.06603842 = score(doc=1673,freq=6.0), product of:
              0.15105948 = queryWeight, product of:
                3.2635105 = idf(docFreq=4597, maxDocs=44218)
                0.04628742 = queryNorm
              0.43716836 = fieldWeight in 1673, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                3.2635105 = idf(docFreq=4597, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1673)
          0.043899145 = weight(_text_:22 in 1673) [ClassicSimilarity], result of:
            0.043899145 = score(doc=1673,freq=2.0), product of:
              0.16209066 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.04628742 = queryNorm
              0.2708308 = fieldWeight in 1673, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1673)
      0.4 = coord(2/5)
    
    Abstract
    The Wolverhampton Web Library (WWLib) is a WWW search engine that provides access to UK based information. The experimental version developed in 1995, was a success but highlighted the need for a much higher degree of automation. An interesting feature of the experimental WWLib was that it organised information according to DDC. Discusses the advantages of classification and describes the automatic classifier that is being developed in Java as part of the new, fully automated WWLib
    Date
    1. 8.1996 22:08:06
    Footnote
    Contribution to a special issue devoted to the Proceedings of the 7th International World Wide Web Conference, held 14-18 April 1998, Brisbane, Australia; vgl. auch: http://www7.scu.edu.au/programme/posters/1846/com1846.htm.
    Theme
    Klassifikationssysteme im Online-Retrieval
  4. Zhu, W.Z.; Allen, R.B.: Document clustering using the LSI subspace signature model (2013) 0.05
    0.049962994 = product of:
      0.12490748 = sum of:
        0.10609356 = weight(_text_:semantic in 690) [ClassicSimilarity], result of:
          0.10609356 = score(doc=690,freq=8.0), product of:
            0.19245663 = queryWeight, product of:
              4.1578603 = idf(docFreq=1879, maxDocs=44218)
              0.04628742 = queryNorm
            0.5512596 = fieldWeight in 690, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              4.1578603 = idf(docFreq=1879, maxDocs=44218)
              0.046875 = fieldNorm(doc=690)
        0.01881392 = product of:
          0.03762784 = sum of:
            0.03762784 = weight(_text_:22 in 690) [ClassicSimilarity], result of:
              0.03762784 = score(doc=690,freq=2.0), product of:
                0.16209066 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04628742 = queryNorm
                0.23214069 = fieldWeight in 690, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.046875 = fieldNorm(doc=690)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    We describe the latent semantic indexing subspace signature model (LSISSM) for semantic content representation of unstructured text. Grounded on singular value decomposition, the model represents terms and documents by the distribution signatures of their statistical contribution across the top-ranking latent concept dimensions. LSISSM matches term signatures with document signatures according to their mapping coherence between latent semantic indexing (LSI) term subspace and LSI document subspace. LSISSM does feature reduction and finds a low-rank approximation of scalable and sparse term-document matrices. Experiments demonstrate that this approach significantly improves the performance of major clustering algorithms such as standard K-means and self-organizing maps compared with the vector space model and the traditional LSI model. The unique contribution ranking mechanism in LSISSM also improves the initialization of standard K-means compared with random seeding procedure, which sometimes causes low efficiency and effectiveness of clustering. A two-stage initialization strategy based on LSISSM significantly reduces the running time of standard K-means procedures.
    Date
    23. 3.2013 13:22:36
    Object
    Latent semantic indexing
  5. Ingwersen, P.; Wormell, I.: Ranganathan in the perspective of advanced information retrieval (1992) 0.05
    0.04946837 = product of:
      0.12367092 = sum of:
        0.052941877 = weight(_text_:retrieval in 7695) [ClassicSimilarity], result of:
          0.052941877 = score(doc=7695,freq=4.0), product of:
            0.14001551 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.04628742 = queryNorm
            0.37811437 = fieldWeight in 7695, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.0625 = fieldNorm(doc=7695)
        0.07072904 = weight(_text_:semantic in 7695) [ClassicSimilarity], result of:
          0.07072904 = score(doc=7695,freq=2.0), product of:
            0.19245663 = queryWeight, product of:
              4.1578603 = idf(docFreq=1879, maxDocs=44218)
              0.04628742 = queryNorm
            0.36750638 = fieldWeight in 7695, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.1578603 = idf(docFreq=1879, maxDocs=44218)
              0.0625 = fieldNorm(doc=7695)
      0.4 = coord(2/5)
    
    Abstract
    Examnines Ranganathan's approach to knowledge organisation and its relevance to intellectual accessibility in libraries. Discusses the current and future developments of his methodology and theories in knowledge-based systems. Topics covered include: semi-automatic classification and structure of thesauri; user-intermediary interactions in information retrieval (IR); semantic value-theory and uncertainty principles in IR; and case grammar
  6. Ru, C.; Tang, J.; Li, S.; Xie, S.; Wang, T.: Using semantic similarity to reduce wrong labels in distant supervision for relation extraction (2018) 0.05
    0.04889763 = product of:
      0.122244075 = sum of:
        0.023397226 = weight(_text_:retrieval in 5055) [ClassicSimilarity], result of:
          0.023397226 = score(doc=5055,freq=2.0), product of:
            0.14001551 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.04628742 = queryNorm
            0.16710453 = fieldWeight in 5055, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5055)
        0.09884685 = weight(_text_:semantic in 5055) [ClassicSimilarity], result of:
          0.09884685 = score(doc=5055,freq=10.0), product of:
            0.19245663 = queryWeight, product of:
              4.1578603 = idf(docFreq=1879, maxDocs=44218)
              0.04628742 = queryNorm
            0.51360583 = fieldWeight in 5055, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              4.1578603 = idf(docFreq=1879, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5055)
      0.4 = coord(2/5)
    
    Abstract
    Distant supervision (DS) has the advantage of automatically generating large amounts of labelled training data and has been widely used for relation extraction. However, there are usually many wrong labels in the automatically labelled data in distant supervision (Riedel, Yao, & McCallum, 2010). This paper presents a novel method to reduce the wrong labels. The proposed method uses the semantic Jaccard with word embedding to measure the semantic similarity between the relation phrase in the knowledge base and the dependency phrases between two entities in a sentence to filter the wrong labels. In the process of reducing wrong labels, the semantic Jaccard algorithm selects a core dependency phrase to represent the candidate relation in a sentence, which can capture features for relation classification and avoid the negative impact from irrelevant term sequences that previous neural network models of relation extraction often suffer. In the process of relation classification, the core dependency phrases are also used as the input of a convolutional neural network (CNN) for relation classification. The experimental results show that compared with the methods using original DS data, the methods using filtered DS data performed much better in relation extraction. It indicates that the semantic similarity based method is effective in reducing wrong labels. The relation extraction performance of the CNN model using the core dependency phrases as input is the best of all, which indicates that using the core dependency phrases as input of CNN is enough to capture the features for relation classification and could avoid negative impact from irrelevant terms.
    Theme
    Semantisches Umfeld in Indexierung u. Retrieval
  7. Smiraglia, R.P.; Cai, X.: Tracking the evolution of clustering, machine learning, automatic indexing and automatic classification in knowledge organization (2017) 0.05
    0.04873186 = product of:
      0.08121976 = sum of:
        0.023397226 = weight(_text_:retrieval in 3627) [ClassicSimilarity], result of:
          0.023397226 = score(doc=3627,freq=2.0), product of:
            0.14001551 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.04628742 = queryNorm
            0.16710453 = fieldWeight in 3627, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3627)
        0.04420565 = weight(_text_:semantic in 3627) [ClassicSimilarity], result of:
          0.04420565 = score(doc=3627,freq=2.0), product of:
            0.19245663 = queryWeight, product of:
              4.1578603 = idf(docFreq=1879, maxDocs=44218)
              0.04628742 = queryNorm
            0.22969149 = fieldWeight in 3627, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.1578603 = idf(docFreq=1879, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3627)
        0.013616893 = product of:
          0.027233787 = sum of:
            0.027233787 = weight(_text_:web in 3627) [ClassicSimilarity], result of:
              0.027233787 = score(doc=3627,freq=2.0), product of:
                0.15105948 = queryWeight, product of:
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.04628742 = queryNorm
                0.18028519 = fieldWeight in 3627, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3627)
          0.5 = coord(1/2)
      0.6 = coord(3/5)
    
    Abstract
    A very important extension of the traditional domain of knowledge organization (KO) arises from attempts to incorporate techniques devised in the computer science domain for automatic concept extraction and for grouping, categorizing, clustering and otherwise organizing knowledge using mechanical means. Four specific terms have emerged to identify the most prevalent techniques: machine learning, clustering, automatic indexing, and automatic classification. Our study presents three domain analytical case analyses in search of answers. The first case relies on citations located using the ISKO-supported "Knowledge Organization Bibliography." The second case relies on works in both Web of Science and SCOPUS. Case three applies co-word analysis and citation analysis to the contents of the papers in the present special issue. We observe scholars involved in "clustering" and "automatic classification" who share common thematic emphases. But we have found no coherence, no common activity and no social semantics. We have not found a research front, or a common teleology within the KO domain. We also have found a lively group of authors who have succeeded in submitting papers to this special issue, and their work quite interestingly aligns with the case studies we report. There is an emphasis on KO for information retrieval; there is much work on clustering (which involves conceptual points within texts) and automatic classification (which involves semantic groupings at the meta-document level).
  8. HaCohen-Kerner, Y. et al.: Classification using various machine learning methods and combinations of key-phrases and visual features (2016) 0.05
    0.047907133 = product of:
      0.11976783 = sum of:
        0.0884113 = weight(_text_:semantic in 2748) [ClassicSimilarity], result of:
          0.0884113 = score(doc=2748,freq=2.0), product of:
            0.19245663 = queryWeight, product of:
              4.1578603 = idf(docFreq=1879, maxDocs=44218)
              0.04628742 = queryNorm
            0.45938298 = fieldWeight in 2748, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.1578603 = idf(docFreq=1879, maxDocs=44218)
              0.078125 = fieldNorm(doc=2748)
        0.031356532 = product of:
          0.062713064 = sum of:
            0.062713064 = weight(_text_:22 in 2748) [ClassicSimilarity], result of:
              0.062713064 = score(doc=2748,freq=2.0), product of:
                0.16209066 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04628742 = queryNorm
                0.38690117 = fieldWeight in 2748, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.078125 = fieldNorm(doc=2748)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Date
    1. 2.2016 18:25:22
    Source
    Semantic keyword-based search on structured data sources: First COST Action IC1302 International KEYSTONE Conference, IKC 2015, Coimbra, Portugal, September 8-9, 2015. Revised Selected Papers. Eds.: J. Cardoso et al
  9. Dolin, R.; Agrawal, D.; El Abbadi, A.; Pearlman, J.: Using automated classification for summarizing and selecting heterogeneous information sources (1998) 0.03
    0.034758545 = product of:
      0.05793091 = sum of:
        0.019853204 = weight(_text_:retrieval in 1253) [ClassicSimilarity], result of:
          0.019853204 = score(doc=1253,freq=4.0), product of:
            0.14001551 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.04628742 = queryNorm
            0.1417929 = fieldWeight in 1253, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.0234375 = fieldNorm(doc=1253)
        0.02652339 = weight(_text_:semantic in 1253) [ClassicSimilarity], result of:
          0.02652339 = score(doc=1253,freq=2.0), product of:
            0.19245663 = queryWeight, product of:
              4.1578603 = idf(docFreq=1879, maxDocs=44218)
              0.04628742 = queryNorm
            0.1378149 = fieldWeight in 1253, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.1578603 = idf(docFreq=1879, maxDocs=44218)
              0.0234375 = fieldNorm(doc=1253)
        0.011554317 = product of:
          0.023108633 = sum of:
            0.023108633 = weight(_text_:web in 1253) [ClassicSimilarity], result of:
              0.023108633 = score(doc=1253,freq=4.0), product of:
                0.15105948 = queryWeight, product of:
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.04628742 = queryNorm
                0.15297705 = fieldWeight in 1253, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.0234375 = fieldNorm(doc=1253)
          0.5 = coord(1/2)
      0.6 = coord(3/5)
    
    Abstract
    Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. Important sources of information include not only traditional databases with structured data and queries, but also increasing numbers of non-traditional, semi- or unstructured collections such as Web sites, FTP archives, etc. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC), within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. For such a system to be useful in a large-scale distributed environment, it must be easy to use for both collection managers and users. As a result, it must be possible to classify documents automatically within a classification scheme. Furthermore, there must be a straightforward and intuitive interface with which the user may use the scheme to assist in information retrieval (IR). Our work with the Alexandria Digital Library (ADL) Project focuses on geo-referenced information, whether text, maps, aerial photographs, or satellite images. As a result, we have emphasized techniques which work with both text and non-text, such as combined textual and graphical queries, multi-dimensional indexing, and IR methods which are not solely dependent on words or phrases. Part of this work involves locating relevant online sources of information. In particular, we have designed and are currently testing aspects of an architecture, Pharos, which we believe will scale up to 1.000.000 heterogeneous sources. Pharos accommodates heterogeneity in content and format, both among multiple sources as well as within a single source. That is, we consider sources to include Web sites, FTP archives, newsgroups, and full digital libraries; all of these systems can include a wide variety of content and multimedia data formats. Pharos is based on the use of hierarchical classification schemes. These include not only well-known 'subject' (or 'concept') based schemes such as the Dewey Decimal System and the LCC, but also, for example, geographic classifications, which might be constructed as layers of smaller and smaller hierarchical longitude/latitude boxes. Pharos is designed to work with sophisticated queries which utilize subjects, geographical locations, temporal specifications, and other types of information domains. The Pharos architecture requires that hierarchically structured collection metadata be extracted so that it can be partitioned in such a way as to greatly enhance scalability. Automated classification is important to Pharos because it allows information sources to extract the requisite collection metadata automatically that must be distributed.
    We are currently experimenting with newsgroups as collections. We have built an initial prototype which automatically classifies and summarizes newsgroups within the LCC. (The prototype can be tested below, and more details may be found at http://pharos.alexandria.ucsb.edu/). The prototype uses electronic library catalog records as a `training set' and Latent Semantic Indexing (LSI) for IR. We use the training set to build a rich set of classification terminology, and associate these terms with the relevant categories in the LCC. This association between terms and classification categories allows us to relate users' queries to nodes in the LCC so that users can select appropriate query categories. Newsgroups are similarly associated with classification categories. Pharos then matches the categories selected by users to relevant newsgroups. In principle, this approach allows users to exclude newsgroups that might have been selected based on an unintended meaning of a query term, and to include newsgroups with relevant content even though the exact query terms may not have been used. This work is extensible to other types of classification, including geographical, temporal, and image feature. Before discussing the methodology of the collection summarization and selection, we first present an online demonstration below. The demonstration is not intended to be a complete end-user interface. Rather, it is intended merely to offer a view of the process to suggest the "look and feel" of the prototype. The demo works as follows. First supply it with a few keywords of interest. The system will then use those terms to try to return to you the most relevant subject categories within the LCC. Assuming that the system recognizes any of your terms (it has over 400,000 terms indexed), it will give you a list of 15 LCC categories sorted by relevancy ranking. From there, you have two choices. The first choice, by clicking on the "News" links, is to get a list of newsgroups which the system has identified as relevant to the LCC category you select. The other choice, by clicking on the LCC ID links, is to enter the LCC hierarchy starting at the category of your choice and navigate the tree until you locate the best category for your query. From there, again, you can get a list of newsgroups by clicking on the "News" links. After having shown this demonstration to many people, we would like to suggest that you first give it easier examples before trying to break it. For example, "prostate cancer" (discussed below), "remote sensing", "investment banking", and "gershwin" all work reasonably well.
  10. Reiner, U.: Automatische DDC-Klassifizierung von bibliografischen Titeldatensätzen (2009) 0.03
    0.031260397 = product of:
      0.07815099 = sum of:
        0.04679445 = weight(_text_:retrieval in 611) [ClassicSimilarity], result of:
          0.04679445 = score(doc=611,freq=2.0), product of:
            0.14001551 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.04628742 = queryNorm
            0.33420905 = fieldWeight in 611, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.078125 = fieldNorm(doc=611)
        0.031356532 = product of:
          0.062713064 = sum of:
            0.062713064 = weight(_text_:22 in 611) [ClassicSimilarity], result of:
              0.062713064 = score(doc=611,freq=2.0), product of:
                0.16209066 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04628742 = queryNorm
                0.38690117 = fieldWeight in 611, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.078125 = fieldNorm(doc=611)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Date
    22. 8.2009 12:54:24
    Theme
    Klassifikationssysteme im Online-Retrieval
  11. Reiner, U.: Automatische DDC-Klassifizierung bibliografischer Titeldatensätze der Deutschen Nationalbibliografie (2009) 0.03
    0.029845808 = product of:
      0.07461452 = sum of:
        0.01871778 = weight(_text_:retrieval in 3284) [ClassicSimilarity], result of:
          0.01871778 = score(doc=3284,freq=2.0), product of:
            0.14001551 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.04628742 = queryNorm
            0.13368362 = fieldWeight in 3284, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.03125 = fieldNorm(doc=3284)
        0.055896737 = sum of:
          0.030811511 = weight(_text_:web in 3284) [ClassicSimilarity], result of:
            0.030811511 = score(doc=3284,freq=4.0), product of:
              0.15105948 = queryWeight, product of:
                3.2635105 = idf(docFreq=4597, maxDocs=44218)
                0.04628742 = queryNorm
              0.2039694 = fieldWeight in 3284, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.2635105 = idf(docFreq=4597, maxDocs=44218)
                0.03125 = fieldNorm(doc=3284)
          0.025085226 = weight(_text_:22 in 3284) [ClassicSimilarity], result of:
            0.025085226 = score(doc=3284,freq=2.0), product of:
              0.16209066 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.04628742 = queryNorm
              0.15476047 = fieldWeight in 3284, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.03125 = fieldNorm(doc=3284)
      0.4 = coord(2/5)
    
    Abstract
    Die Menge der zu klassifizierenden Veröffentlichungen steigt spätestens seit der Existenz des World Wide Web schneller an, als sie intellektuell sachlich erschlossen werden kann. Daher werden Verfahren gesucht, um die Klassifizierung von Textobjekten zu automatisieren oder die intellektuelle Klassifizierung zumindest zu unterstützen. Seit 1968 gibt es Verfahren zur automatischen Dokumentenklassifizierung (Information Retrieval, kurz: IR) und seit 1992 zur automatischen Textklassifizierung (ATC: Automated Text Categorization). Seit immer mehr digitale Objekte im World Wide Web zur Verfügung stehen, haben Arbeiten zur automatischen Textklassifizierung seit ca. 1998 verstärkt zugenommen. Dazu gehören seit 1996 auch Arbeiten zur automatischen DDC-Klassifizierung bzw. RVK-Klassifizierung von bibliografischen Titeldatensätzen und Volltextdokumenten. Bei den Entwicklungen handelt es sich unseres Wissens bislang um experimentelle und keine im ständigen Betrieb befindlichen Systeme. Auch das VZG-Projekt Colibri/DDC ist seit 2006 u. a. mit der automatischen DDC-Klassifizierung befasst. Die diesbezüglichen Untersuchungen und Entwicklungen dienen zur Beantwortung der Forschungsfrage: "Ist es möglich, eine inhaltlich stimmige DDC-Titelklassifikation aller GVK-PLUS-Titeldatensätze automatisch zu erzielen?"
    Date
    22. 1.2010 14:41:24
  12. Wätjen, H.-J.: Automatisches Sammeln, Klassifizieren und Indexieren von wissenschaftlich relevanten Informationsressourcen im deutschen World Wide Web : das DFG-Projekt GERHARD (1998) 0.03
    0.029611295 = product of:
      0.07402824 = sum of:
        0.04679445 = weight(_text_:retrieval in 3066) [ClassicSimilarity], result of:
          0.04679445 = score(doc=3066,freq=2.0), product of:
            0.14001551 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.04628742 = queryNorm
            0.33420905 = fieldWeight in 3066, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.078125 = fieldNorm(doc=3066)
        0.027233787 = product of:
          0.054467574 = sum of:
            0.054467574 = weight(_text_:web in 3066) [ClassicSimilarity], result of:
              0.054467574 = score(doc=3066,freq=2.0), product of:
                0.15105948 = queryWeight, product of:
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.04628742 = queryNorm
                0.36057037 = fieldWeight in 3066, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.078125 = fieldNorm(doc=3066)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Theme
    Klassifikationssysteme im Online-Retrieval
  13. Vizine-Goetz, D.: NetLab / OCLC collaboration seeks to improve Web searching (1999) 0.03
    0.029611295 = product of:
      0.07402824 = sum of:
        0.04679445 = weight(_text_:retrieval in 4180) [ClassicSimilarity], result of:
          0.04679445 = score(doc=4180,freq=2.0), product of:
            0.14001551 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.04628742 = queryNorm
            0.33420905 = fieldWeight in 4180, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.078125 = fieldNorm(doc=4180)
        0.027233787 = product of:
          0.054467574 = sum of:
            0.054467574 = weight(_text_:web in 4180) [ClassicSimilarity], result of:
              0.054467574 = score(doc=4180,freq=2.0), product of:
                0.15105948 = queryWeight, product of:
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.04628742 = queryNorm
                0.36057037 = fieldWeight in 4180, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.078125 = fieldNorm(doc=4180)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Theme
    Klassifikationssysteme im Online-Retrieval
  14. Möller, G.: Automatic classification of the World Wide Web using Universal Decimal Classification (1999) 0.03
    0.029611295 = product of:
      0.07402824 = sum of:
        0.04679445 = weight(_text_:retrieval in 494) [ClassicSimilarity], result of:
          0.04679445 = score(doc=494,freq=2.0), product of:
            0.14001551 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.04628742 = queryNorm
            0.33420905 = fieldWeight in 494, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.078125 = fieldNorm(doc=494)
        0.027233787 = product of:
          0.054467574 = sum of:
            0.054467574 = weight(_text_:web in 494) [ClassicSimilarity], result of:
              0.054467574 = score(doc=494,freq=2.0), product of:
                0.15105948 = queryWeight, product of:
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.04628742 = queryNorm
                0.36057037 = fieldWeight in 494, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.078125 = fieldNorm(doc=494)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Theme
    Klassifikationssysteme im Online-Retrieval
  15. Shen, D.; Chen, Z.; Yang, Q.; Zeng, H.J.; Zhang, B.; Lu, Y.; Ma, W.Y.: Web page classification through summarization (2004) 0.03
    0.029611295 = product of:
      0.07402824 = sum of:
        0.04679445 = weight(_text_:retrieval in 4132) [ClassicSimilarity], result of:
          0.04679445 = score(doc=4132,freq=2.0), product of:
            0.14001551 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.04628742 = queryNorm
            0.33420905 = fieldWeight in 4132, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.078125 = fieldNorm(doc=4132)
        0.027233787 = product of:
          0.054467574 = sum of:
            0.054467574 = weight(_text_:web in 4132) [ClassicSimilarity], result of:
              0.054467574 = score(doc=4132,freq=2.0), product of:
                0.15105948 = queryWeight, product of:
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.04628742 = queryNorm
                0.36057037 = fieldWeight in 4132, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.078125 = fieldNorm(doc=4132)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Source
    SIGIR'04: Proceedings of the 27th Annual International ACM-SIGIR Conference an Research and Development in Information Retrieval. Ed.: K. Järvelin, u.a
  16. Chan, L.M.; Lin, X.; Zeng, M.: Structural and multilingual approaches to subject access on the Web (1999) 0.03
    0.027298829 = product of:
      0.06824707 = sum of:
        0.03743556 = weight(_text_:retrieval in 162) [ClassicSimilarity], result of:
          0.03743556 = score(doc=162,freq=2.0), product of:
            0.14001551 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.04628742 = queryNorm
            0.26736724 = fieldWeight in 162, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.0625 = fieldNorm(doc=162)
        0.030811511 = product of:
          0.061623022 = sum of:
            0.061623022 = weight(_text_:web in 162) [ClassicSimilarity], result of:
              0.061623022 = score(doc=162,freq=4.0), product of:
                0.15105948 = queryWeight, product of:
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.04628742 = queryNorm
                0.4079388 = fieldWeight in 162, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.0625 = fieldNorm(doc=162)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    Zu den großen Herausforderungen einer sinnvollen Suche im WWW gehören die riesige Menge des Verfügbaren und die Sparchbarrieren. Verfahren, die die Web-Ressourcen im Hinblick auf ein effizienteres Retrieval inhaltlich strukturieren, werden daher ebenso dringend benötigt wie Programme, die mit der Sprachvielfalt umgehen können. Im folgenden Vortrag werden wir einige Ansätze diskutieren, die zur Bewältigung der beiden Probleme derzeit unternommen werden
  17. Mengle, S.; Goharian, N.: Passage detection using text classification (2009) 0.03
    0.027198423 = product of:
      0.067996055 = sum of:
        0.052317787 = weight(_text_:retrieval in 2765) [ClassicSimilarity], result of:
          0.052317787 = score(doc=2765,freq=10.0), product of:
            0.14001551 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.04628742 = queryNorm
            0.37365708 = fieldWeight in 2765, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2765)
        0.015678266 = product of:
          0.031356532 = sum of:
            0.031356532 = weight(_text_:22 in 2765) [ClassicSimilarity], result of:
              0.031356532 = score(doc=2765,freq=2.0), product of:
                0.16209066 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04628742 = queryNorm
                0.19345059 = fieldWeight in 2765, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2765)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retrieval is the determination of the degree of relevance of blocks of text, namely passages, comprising a document. Rather than determining the relevance of a document in its entirety, passage retrieval determines the relevance of the individual passages. As such, modified traditional information-retrieval techniques compare terms found in user queries with the individual passages to determine a similarity score for passages of interest. In passage detection, passages are classified into predetermined categories. More often than not, passage detection techniques are deployed to detect hidden paragraphs in documents. That is, to hide information, documents are injected with hidden text into passages. Rather than matching query terms against passages to determine their relevance, using text-mining techniques, the passages are classified. Those documents with hidden passages are defined as infected. Thus, simply stated, passage retrieval is the search for passages relevant to a user query, while passage detection is the classification of passages. That is, in passage detection, passages are labeled with one or more categories from a set of predetermined categories. We present a keyword-based dynamic passage approach (KDP) and demonstrate that KDP outperforms statistically significantly (99% confidence) the other document-splitting approaches by 12% to 18% in the passage detection and passage category-prediction tasks. Furthermore, we evaluate the effects of the feature selection, passage length, ambiguous passages, and finally training-data category distribution on passage-detection accuracy.
    Date
    22. 3.2009 19:14:43
  18. Miyamoto, S.: Information clustering based an fuzzy multisets (2003) 0.03
    0.026155118 = product of:
      0.06538779 = sum of:
        0.04632414 = weight(_text_:retrieval in 1071) [ClassicSimilarity], result of:
          0.04632414 = score(doc=1071,freq=4.0), product of:
            0.14001551 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.04628742 = queryNorm
            0.33085006 = fieldWeight in 1071, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1071)
        0.019063652 = product of:
          0.038127303 = sum of:
            0.038127303 = weight(_text_:web in 1071) [ClassicSimilarity], result of:
              0.038127303 = score(doc=1071,freq=2.0), product of:
                0.15105948 = queryWeight, product of:
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.04628742 = queryNorm
                0.25239927 = fieldWeight in 1071, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=1071)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    A fuzzy multiset model for information clustering is proposed with application to information retrieval on the World Wide Web. Noting that a search engine retrieves multiple occurrences of the same subjects with possibly different degrees of relevance, we observe that fuzzy multisets provide an appropriate model of information retrieval on the WWW. Information clustering which means both term clustering and document clustering is considered. Three methods of the hard c-means, fuzzy c-means, and an agglomerative method using cluster centers are proposed. Two distances between fuzzy multisets and algorithms for calculating cluster centers are defined. Theoretical properties concerning the clustering algorithms are studied. Illustrative examples are given to show how the algorithms work.
  19. Egbert, J.; Biber, D.; Davies, M.: Developing a bottom-up, user-based method of web register classification (2015) 0.03
    0.026012475 = product of:
      0.13006237 = sum of:
        0.13006237 = sum of:
          0.09243453 = weight(_text_:web in 2158) [ClassicSimilarity], result of:
            0.09243453 = score(doc=2158,freq=16.0), product of:
              0.15105948 = queryWeight, product of:
                3.2635105 = idf(docFreq=4597, maxDocs=44218)
                0.04628742 = queryNorm
              0.6119082 = fieldWeight in 2158, product of:
                4.0 = tf(freq=16.0), with freq of:
                  16.0 = termFreq=16.0
                3.2635105 = idf(docFreq=4597, maxDocs=44218)
                0.046875 = fieldNorm(doc=2158)
          0.03762784 = weight(_text_:22 in 2158) [ClassicSimilarity], result of:
            0.03762784 = score(doc=2158,freq=2.0), product of:
              0.16209066 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.04628742 = queryNorm
              0.23214069 = fieldWeight in 2158, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.046875 = fieldNorm(doc=2158)
      0.2 = coord(1/5)
    
    Abstract
    This paper introduces a project to develop a reliable, cost-effective method for classifying Internet texts into register categories, and apply that approach to the analysis of a large corpus of web documents. To date, the project has proceeded in 2 key phases. First, we developed a bottom-up method for web register classification, asking end users of the web to utilize a decision-tree survey to code relevant situational characteristics of web documents, resulting in a bottom-up identification of register and subregister categories. We present details regarding the development and testing of this method through a series of 10 pilot studies. Then, in the second phase of our project we applied this procedure to a corpus of 53,000 web documents. An analysis of the results demonstrates the effectiveness of these methods for web register classification and provides a preliminary description of the types and distribution of registers on the web.
    Date
    4. 8.2015 19:22:04
  20. Calado, P.; Cristo, M.; Gonçalves, M.A.; Moura, E.S. de; Ribeiro-Neto, B.; Ziviani, N.: Link-based similarity measures for the classification of Web documents (2006) 0.03
    0.025699163 = product of:
      0.064247906 = sum of:
        0.023397226 = weight(_text_:retrieval in 4921) [ClassicSimilarity], result of:
          0.023397226 = score(doc=4921,freq=2.0), product of:
            0.14001551 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.04628742 = queryNorm
            0.16710453 = fieldWeight in 4921, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4921)
        0.040850677 = product of:
          0.08170135 = sum of:
            0.08170135 = weight(_text_:web in 4921) [ClassicSimilarity], result of:
              0.08170135 = score(doc=4921,freq=18.0), product of:
                0.15105948 = queryWeight, product of:
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.04628742 = queryNorm
                0.5408555 = fieldWeight in 4921, product of:
                  4.2426405 = tf(freq=18.0), with freq of:
                    18.0 = termFreq=18.0
                  3.2635105 = idf(docFreq=4597, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=4921)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    Traditional text-based document classifiers tend to perform poorly an the Web. Text in Web documents is usually noisy and often does not contain enough information to determine their topic. However, the Web provides a different source that can be useful to document classification: its hyperlink structure. In this work, the authors evaluate how the link structure of the Web can be used to determine a measure of similarity appropriate for document classification. They experiment with five different similarity measures and determine their adequacy for predicting the topic of a Web page. Tests performed an a Web directory Show that link information alone allows classifying documents with an average precision of 86%. Further, when combined with a traditional textbased classifier, precision increases to values of up to 90%, representing gains that range from 63 to 132% over the use of text-based classification alone. Because the measures proposed in this article are straightforward to compute, they provide a practical and effective solution for Web classification and related information retrieval tasks. Further, the authors provide an important set of guidelines an how link structure can be used effectively to classify Web documents.

Years

Languages

  • e 95
  • d 23
  • chi 1
  • More… Less…

Types

  • a 94
  • el 22
  • m 4
  • x 4
  • r 3
  • s 2
  • d 1
  • More… Less…