Search (39 results, page 1 of 2)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.15

0.14797027 = product of:
  0.24661711 = sum of:
    0.057946928 = product of:
      0.17384078 = sum of:
        0.17384078 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
          0.17384078 = score(doc=562,freq=2.0), product of:
            0.3093153 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.036484417 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.33333334 = coord(1/3)
    0.17384078 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.17384078 = score(doc=562,freq=2.0), product of:
        0.3093153 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.036484417 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.014829405 = product of:
      0.02965881 = sum of:
        0.02965881 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
          0.02965881 = score(doc=562,freq=2.0), product of:
            0.12776221 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.036484417 = queryNorm
            0.23214069 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.5 = coord(1/2)
  0.6 = coord(3/5)

Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32

Golub, K.: Automated subject classification of textual Web pages, based on a controlled vocabulary : challenges and recommendations (2006) 0.01
```
0.0071366318 = product of:
  0.03568316 = sum of:
    0.03568316 = product of:
      0.07136632 = sum of:
        0.07136632 = weight(_text_:problems in 5897) [ClassicSimilarity], result of:
          0.07136632 = score(doc=5897,freq=6.0), product of:
            0.15058853 = queryWeight, product of:
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.036484417 = queryNorm
            0.47391602 = fieldWeight in 5897, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.046875 = fieldNorm(doc=5897)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

The primary objective of this study was to identify and address problems of applying a controlled vocabulary in automated subject classification of textual Web pages, in the area of engineering. Web pages have special characteristics such as structural information, but are at the same time rather heterogeneous. The classification approach used comprises string-to-string matching between words in a term list extracted from the Ei (Engineering Information) thesaurus and classification scheme, and words in the text to be classified. Based on a sample of 70 Web pages, a number of problems with the term list are identified. Reasons for those problems are discussed and improvements proposed. Methods for implementing the improvements are also specified, suggesting further research.
Ozmutlu, S.; Cosar, G.C.: Analyzing the results of automatic new topic identification (2008) 0.01
```
0.0071366318 = product of:
  0.03568316 = sum of:
    0.03568316 = product of:
      0.07136632 = sum of:
        0.07136632 = weight(_text_:problems in 2604) [ClassicSimilarity], result of:
          0.07136632 = score(doc=2604,freq=6.0), product of:
            0.15058853 = queryWeight, product of:
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.036484417 = queryNorm
            0.47391602 = fieldWeight in 2604, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.046875 = fieldNorm(doc=2604)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

Purpose - Identification of topic changes within a user search session is a key issue in content analysis of search engine user queries. Recently, various studies have focused on new topic identification/session identification of search engine transaction logs, and several problems regarding the estimation of topic shifts and continuations were observed in these studies. This study aims to analyze the reasons for the problems that were encountered as a result of applying automatic new topic identification. Design/methodology/approach - Measures, such as cleaning the data of common words and analyzing the errors of automatic new topic identification, are applied to eliminate the problems in estimating topic shifts and continuations. Findings - The findings show that the resulting errors of automatic new topic identification have a pattern, and further research is required to improve the performance of automatic new topic identification. Originality/value - Improving the performance of automatic new topic identification would be valuable to search engine designers, so that they can develop new clustering and query recommendation algorithms, as well as custom-tailored graphical user interfaces for search engine users.
Dolin, R.; Agrawal, D.; El Abbadi, A.; Pearlman, J.: Using automated classification for summarizing and selecting heterogeneous information sources (1998) 0.01
```
0.007095774 = product of:
  0.03547887 = sum of:
    0.03547887 = product of:
      0.07095774 = sum of:
        0.07095774 = weight(_text_:etc in 316) [ClassicSimilarity], result of:
          0.07095774 = score(doc=316,freq=2.0), product of:
            0.19761753 = queryWeight, product of:
              5.4164915 = idf(docFreq=533, maxDocs=44218)
              0.036484417 = queryNorm
            0.35906604 = fieldWeight in 316, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.4164915 = idf(docFreq=533, maxDocs=44218)
              0.046875 = fieldNorm(doc=316)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. Important sources of information include not only traditional databases with structured data and queries, but also increasing numbers of non-traditional, semi- or unstructured collections such as Web sites, FTP archives, etc. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC) [10], within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. For such a system to be useful in a large-scale distributed environment, it must be easy to use for both collection managers and users. As a result, it must be possible to classify documents automatically within a classification scheme. Furthermore, there must be a straightforward and intuitive interface with which the user may use the scheme to assist in information retrieval (IR).
Denoyer, L.; Gallinari, P.: Bayesian network model for semi-structured document classification (2004) 0.01
```
0.007095774 = product of:
  0.03547887 = sum of:
    0.03547887 = product of:
      0.07095774 = sum of:
        0.07095774 = weight(_text_:etc in 995) [ClassicSimilarity], result of:
          0.07095774 = score(doc=995,freq=2.0), product of:
            0.19761753 = queryWeight, product of:
              5.4164915 = idf(docFreq=533, maxDocs=44218)
              0.036484417 = queryNorm
            0.35906604 = fieldWeight in 995, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.4164915 = idf(docFreq=533, maxDocs=44218)
              0.046875 = fieldNorm(doc=995)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

Recently, a new community has started to emerge around the development of new information research methods for searching and analyzing semi-structured and XML like documents. The goal is to handle both content and structural information, and to deal with different types of information content (text, image, etc.). We consider here the task of structured document classification. We propose a generative model able to handle both structure and content which is based on Bayesian networks. We then show how to transform this generative model into a discriminant classifier using the method of Fisher kernel. The model is then extended for dealing with different types of content information (here text and images). The model was tested on three databases: the classical webKB corpus composed of HTML pages, the new INEX corpus which has become a reference in the field of ad-hoc retrieval for XML documents, and a multimedia corpus of Web pages.
Cosh, K.J.; Burns, R.; Daniel, T.: Content clouds : classifying content in Web 2.0 (2008) 0.01
```
0.007095774 = product of:
  0.03547887 = sum of:
    0.03547887 = product of:
      0.07095774 = sum of:
        0.07095774 = weight(_text_:etc in 2013) [ClassicSimilarity], result of:
          0.07095774 = score(doc=2013,freq=2.0), product of:
            0.19761753 = queryWeight, product of:
              5.4164915 = idf(docFreq=533, maxDocs=44218)
              0.036484417 = queryNorm
            0.35906604 = fieldWeight in 2013, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.4164915 = idf(docFreq=533, maxDocs=44218)
              0.046875 = fieldNorm(doc=2013)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

Purpose - With increasing amounts of user generated content being produced electronically in the form of wikis, blogs, forums etc. the purpose of this paper is to investigate a new approach to classifying ad hoc content. Design/methodology/approach - The approach applies natural language processing (NLP) tools to automatically extract the content of some text, visualizing the results in a content cloud. Findings - Content clouds share the visual simplicity of a tag cloud, but display the details of an article at a different level of abstraction, providing a complimentary classification. Research limitations/implications - Provides the general approach to creating a content cloud. In the future, the process can be refined and enhanced by further evaluation of results. Further work is also required to better identify closely related articles. Practical implications - Being able to automatically classify the content generated by web users will enable others to find more appropriate content. Originality/value - The approach is original. Other researchers have produced a cloud, simply by using skiplists to filter unwanted words, this paper's approach improves this by applying appropriate NLP techniques.

Subramanian, S.; Shafer, K.E.: Clustering (2001) 0.01

0.005931762 = product of:
  0.02965881 = sum of:
    0.02965881 = product of:
      0.05931762 = sum of:
        0.05931762 = weight(_text_:22 in 1046) [ClassicSimilarity], result of:
          0.05931762 = score(doc=1046,freq=2.0), product of:
            0.12776221 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.036484417 = queryNorm
            0.46428138 = fieldWeight in 1046, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.09375 = fieldNorm(doc=1046)
      0.5 = coord(1/2)
  0.2 = coord(1/5)

Date: 5. 5.2003 14:17:22

Lim, C.S.; Lee, K.J.; Kim, G.C.: Multiple sets of features for automatic genre classification of web documents (2005) 0.01
```
0.0059131454 = product of:
  0.029565725 = sum of:
    0.029565725 = product of:
      0.05913145 = sum of:
        0.05913145 = weight(_text_:etc in 1048) [ClassicSimilarity], result of:
          0.05913145 = score(doc=1048,freq=2.0), product of:
            0.19761753 = queryWeight, product of:
              5.4164915 = idf(docFreq=533, maxDocs=44218)
              0.036484417 = queryNorm
            0.2992217 = fieldWeight in 1048, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.4164915 = idf(docFreq=533, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1048)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

With the increase of information on the Web, it is difficult to find desired information quickly out of the documents retrieved by a search engine. One way to solve this problem is to classify web documents according to various criteria. Most document classification has been focused on a subject or a topic of a document. A genre or a style is another view of a document different from a subject or a topic. The genre is also a criterion to classify documents. In this paper, we suggest multiple sets of features to classify genres of web documents. The basic set of features, which have been proposed in the previous studies, is acquired from the textual properties of documents, such as the number of sentences, the number of a certain word, etc. However, web documents are different from textual documents in that they contain URL and HTML tags within the pages. We introduce new sets of features specific to web documents, which are extracted from URL and HTML tags. The present work is an attempt to evaluate the performance of the proposed sets of features, and to discuss their characteristics. Finally, we conclude which is an appropriate set of features in automatic genre classification of web documents.
Xu, Y.; Bernard, A.: Knowledge organization through statistical computation : a new approach (2009) 0.01
```
0.0058270353 = product of:
  0.029135175 = sum of:
    0.029135175 = product of:
      0.05827035 = sum of:
        0.05827035 = weight(_text_:problems in 3252) [ClassicSimilarity], result of:
          0.05827035 = score(doc=3252,freq=4.0), product of:
            0.15058853 = queryWeight, product of:
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.036484417 = queryNorm
            0.3869508 = fieldWeight in 3252, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.046875 = fieldNorm(doc=3252)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

Knowledge organization (KO) is an interdisciplinary issue which includes some problems in knowledge classification such as how to classify newly emerged knowledge. With the great complexity and ambiguity of knowledge, it is becoming sometimes inefficient to classify knowledge by logical reasoning. This paper attempts to propose a statistical approach to knowledge organization in order to resolve the problems in classifying complex and mass knowledge. By integrating the classification process into a mathematical model, a knowledge classifier, based on the maximum entropy theory, is constructed and the experimental results show that the classification results acquired from the classifier are reliable. The approach proposed in this paper is quite formal and is not dependent on specific contexts, so it could easily be adapted to the use of knowledge classification in other domains within KO.
Desale, S.K.; Kumbhar, R.: Research on automatic classification of documents in library environment : a literature review (2013) 0.01
```
0.0058270353 = product of:
  0.029135175 = sum of:
    0.029135175 = product of:
      0.05827035 = sum of:
        0.05827035 = weight(_text_:problems in 1071) [ClassicSimilarity], result of:
          0.05827035 = score(doc=1071,freq=4.0), product of:
            0.15058853 = queryWeight, product of:
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.036484417 = queryNorm
            0.3869508 = fieldWeight in 1071, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.046875 = fieldNorm(doc=1071)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

This paper aims to provide an overview of automatic classification research, which focuses on issues related to the automatic classification of documents in a library environment. The review covers literature published in mainstream library and information science studies. The review was done on literature published in both academic and professional LIS journals and other documents. This review reveals that basically three types of research are being done on automatic classification: 1) hierarchical classification using different library classification schemes, 2) text categorization and document categorization using different type of classifiers with or without using training documents, and 3) automatic bibliographic classification. Predominantly this research is directed towards solving problems of organization of digital documents in an online environment. However, very little research is devoted towards solving the problems of arrangement of physical documents.

Reiner, U.: Automatische DDC-Klassifizierung von bibliografischen Titeldatensätzen (2009) 0.00

0.004943135 = product of:
  0.024715675 = sum of:
    0.024715675 = product of:
      0.04943135 = sum of:
        0.04943135 = weight(_text_:22 in 611) [ClassicSimilarity], result of:
          0.04943135 = score(doc=611,freq=2.0), product of:
            0.12776221 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.036484417 = queryNorm
            0.38690117 = fieldWeight in 611, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=611)
      0.5 = coord(1/2)
  0.2 = coord(1/5)

Date: 22. 8.2009 12:54:24

HaCohen-Kerner, Y. et al.: Classification using various machine learning methods and combinations of key-phrases and visual features (2016) 0.00

0.004943135 = product of:
  0.024715675 = sum of:
    0.024715675 = product of:
      0.04943135 = sum of:
        0.04943135 = weight(_text_:22 in 2748) [ClassicSimilarity], result of:
          0.04943135 = score(doc=2748,freq=2.0), product of:
            0.12776221 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.036484417 = queryNorm
            0.38690117 = fieldWeight in 2748, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=2748)
      0.5 = coord(1/2)
  0.2 = coord(1/5)

Date: 1. 2.2016 18:25:22

Golub, K.: Automated subject classification of textual web documents (2006) 0.00
```
0.004855863 = product of:
  0.024279313 = sum of:
    0.024279313 = product of:
      0.048558626 = sum of:
        0.048558626 = weight(_text_:problems in 5600) [ClassicSimilarity], result of:
          0.048558626 = score(doc=5600,freq=4.0), product of:
            0.15058853 = queryWeight, product of:
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.036484417 = queryNorm
            0.322459 = fieldWeight in 5600, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5600)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

Purpose - To provide an integrated perspective to similarities and differences between approaches to automated classification in different research communities (machine learning, information retrieval and library science), and point to problems with the approaches and automated classification as such. Design/methodology/approach - A range of works dealing with automated classification of full-text web documents are discussed. Explorations of individual approaches are given in the following sections: special features (description, differences, evaluation), application and characteristics of web pages. Findings - Provides major similarities and differences between the three approaches: document pre-processing and utilization of web-specific document characteristics is common to all the approaches; major differences are in applied algorithms, employment or not of the vector space model and of controlled vocabularies. Problems of automated classification are recognized. Research limitations/implications - The paper does not attempt to provide an exhaustive bibliography of related resources. Practical implications - As an integrated overview of approaches from different research communities with application examples, it is very useful for students in library and information science and computer science, as well as for practitioners. Researchers from one community have the information on how similar tasks are conducted in different communities. Originality/value - To the author's knowledge, no review paper on automated text classification attempted to discuss more than one community's approach from an integrated perspective.
Li, T.; Zhu, S.; Ogihara, M.: Text categorization via generalized discriminant analysis (2008) 0.00
```
0.004855863 = product of:
  0.024279313 = sum of:
    0.024279313 = product of:
      0.048558626 = sum of:
        0.048558626 = weight(_text_:problems in 2119) [ClassicSimilarity], result of:
          0.048558626 = score(doc=2119,freq=4.0), product of:
            0.15058853 = queryWeight, product of:
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.036484417 = queryNorm
            0.322459 = fieldWeight in 2119, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2119)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

Text categorization is an important research area and has been receiving much attention due to the growth of the on-line information and of Internet. Automated text categorization is generally cast as a multi-class classification problem. Much of previous work focused on binary document classification problems. Support vector machines (SVMs) excel in binary classification, but the elegant theory behind large-margin hyperplane cannot be easily extended to multi-class text classification. In addition, the training time and scaling are also important concerns. On the other hand, other techniques naturally extensible to handle multi-class classification are generally not as accurate as SVM. This paper presents a simple and efficient solution to multi-class text categorization. Classification problems are first formulated as optimization via discriminant analysis. Text categorization is then cast as the problem of finding coordinate transformations that reflects the inherent similarity from the data. While most of the previous approaches decompose a multi-class classification problem into multiple independent binary classification tasks, the proposed approach enables direct multi-class classification. By using generalized singular value decomposition (GSVD), a coordinate transformation that reflects the inherent class structure indicated by the generalized singular values is identified. Extensive experiments demonstrate the efficiency and effectiveness of the proposed approach.
Dang, E.K.F.; Luk, R.W.P.; Ho, K.S.; Chan, S.C.F.; Lee, D.L.: ¬A new measure of clustering effectiveness : algorithms and experimental studies (2008) 0.00
```
0.0048070587 = product of:
  0.024035294 = sum of:
    0.024035294 = product of:
      0.048070587 = sum of:
        0.048070587 = weight(_text_:problems in 1367) [ClassicSimilarity], result of:
          0.048070587 = score(doc=1367,freq=2.0), product of:
            0.15058853 = queryWeight, product of:
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.036484417 = queryNorm
            0.31921813 = fieldWeight in 1367, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1367)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

We propose a new optimal clustering effectiveness measure, called CS1, based on a combination of clusters rather than selecting a single optimal cluster as in the traditional MK1 measure. For hierarchical clustering, we present an algorithm to compute CS1, defined by seeking the optimal combinations of disjoint clusters obtained by cutting the hierarchical structure at a certain similarity level. By reformulating the optimization to a 0-1 linear fractional programming problem, we demonstrate that an exact solution can be obtained by a linear time algorithm. We further discuss how our approach can be generalized to more general problems involving overlapping clusters, and we show how optimal estimates can be obtained by greedy algorithms.
Koch, T.; Ardö, A.; Brümmer, A.: ¬The building and maintenance of robot based internet search services : A review of current indexing and data collection methods. Prepared to meet the requirements of Work Package 3 of EU Telematics for Research, project DESIRE. Version D3.11v0.3 (Draft version 3) (1996) 0.00
```
0.0047577545 = product of:
  0.023788773 = sum of:
    0.023788773 = product of:
      0.047577545 = sum of:
        0.047577545 = weight(_text_:problems in 1669) [ClassicSimilarity], result of:
          0.047577545 = score(doc=1669,freq=6.0), product of:
            0.15058853 = queryWeight, product of:
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.036484417 = queryNorm
            0.31594402 = fieldWeight in 1669, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.03125 = fieldNorm(doc=1669)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

After a short outline of problems, possibilities and difficulties of systematic information retrieval on the Internet and a description of efforts for development in this area, a specification of the terminology for this report is required. Although the process of retrieval is generally seen as an iterative process of browsing and information retrieval and several important services on the net have taken this fact into consideration, the emphasis of this report lays on the general retrieval tools for the whole of Internet. In order to be able to evaluate the differences, possibilities and restrictions of the different services it is necessary to begin with organizing the existing varieties in a typological/ taxonomical survey. The possibilities and weaknesses will be briefly compared and described for the most important services in the categories robot-based WWW-catalogues of different types, list- or form-based catalogues and simultaneous or collected search services respectively. It will however for different reasons not be possible to rank them in order of "best" services. Still more important are the weaknesses and problems common for all attempts of indexing the Internet. The problems of the quality of the input, the technical performance and the general problem of indexing virtual hypertext are shown to be at least as difficult as the different aspects of harvesting, indexing and information retrieval. Some of the attempts made in the area of further development of retrieval services will be mentioned in relation to descriptions of the contents of documents and standardization efforts. Internet harvesting and indexing technology and retrieval software is thoroughly reviewed. Details about all services and software are listed in analytical forms in Annex 1-3.
Sebastiani, F.: Machine learning in automated text categorization (2002) 0.00
```
0.0041203364 = product of:
  0.02060168 = sum of:
    0.02060168 = product of:
      0.04120336 = sum of:
        0.04120336 = weight(_text_:problems in 3389) [ClassicSimilarity], result of:
          0.04120336 = score(doc=3389,freq=2.0), product of:
            0.15058853 = queryWeight, product of:
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.036484417 = queryNorm
            0.27361554 = fieldWeight in 3389, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.046875 = fieldNorm(doc=3389)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based an machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Ko, Y.; Seo, J.: Text classification from unlabeled documents with bootstrapping and feature projection techniques (2009) 0.00
```
0.0041203364 = product of:
  0.02060168 = sum of:
    0.02060168 = product of:
      0.04120336 = sum of:
        0.04120336 = weight(_text_:problems in 2452) [ClassicSimilarity], result of:
          0.04120336 = score(doc=2452,freq=2.0), product of:
            0.15058853 = queryWeight, product of:
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.036484417 = queryNorm
            0.27361554 = fieldWeight in 2452, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.1274753 = idf(docFreq=1937, maxDocs=44218)
              0.046875 = fieldNorm(doc=2452)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

Many machine learning algorithms have been applied to text classification tasks. In the machine learning paradigm, a general inductive process automatically builds a text classifier by learning, generally known as supervised learning. However, the supervised learning approaches have some problems. The most notable problem is that they require a large number of labeled training documents for accurate learning. While unlabeled documents are easily collected and plentiful, labeled documents are difficultly generated because a labeling task must be done by human developers. In this paper, we propose a new text classification method based on unsupervised or semi-supervised learning. The proposed method launches text classification tasks with only unlabeled documents and the title word of each category for learning, and then it automatically learns text classifier by using bootstrapping and feature projection techniques. The results of experiments showed that the proposed method achieved reasonably useful performance compared to a supervised method. If the proposed method is used in a text classification task, building text classification systems will become significantly faster and less expensive.
Dolin, R.; Agrawal, D.; El Abbadi, A.; Pearlman, J.: Using automated classification for summarizing and selecting heterogeneous information sources (1998) 0.00
```
0.003547887 = product of:
  0.017739436 = sum of:
    0.017739436 = product of:
      0.03547887 = sum of:
        0.03547887 = weight(_text_:etc in 1253) [ClassicSimilarity], result of:
          0.03547887 = score(doc=1253,freq=2.0), product of:
            0.19761753 = queryWeight, product of:
              5.4164915 = idf(docFreq=533, maxDocs=44218)
              0.036484417 = queryNorm
            0.17953302 = fieldWeight in 1253, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.4164915 = idf(docFreq=533, maxDocs=44218)
              0.0234375 = fieldNorm(doc=1253)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. Important sources of information include not only traditional databases with structured data and queries, but also increasing numbers of non-traditional, semi- or unstructured collections such as Web sites, FTP archives, etc. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC), within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. For such a system to be useful in a large-scale distributed environment, it must be easy to use for both collection managers and users. As a result, it must be possible to classify documents automatically within a classification scheme. Furthermore, there must be a straightforward and intuitive interface with which the user may use the scheme to assist in information retrieval (IR). Our work with the Alexandria Digital Library (ADL) Project focuses on geo-referenced information, whether text, maps, aerial photographs, or satellite images. As a result, we have emphasized techniques which work with both text and non-text, such as combined textual and graphical queries, multi-dimensional indexing, and IR methods which are not solely dependent on words or phrases. Part of this work involves locating relevant online sources of information. In particular, we have designed and are currently testing aspects of an architecture, Pharos, which we believe will scale up to 1.000.000 heterogeneous sources. Pharos accommodates heterogeneity in content and format, both among multiple sources as well as within a single source. That is, we consider sources to include Web sites, FTP archives, newsgroups, and full digital libraries; all of these systems can include a wide variety of content and multimedia data formats. Pharos is based on the use of hierarchical classification schemes. These include not only well-known 'subject' (or 'concept') based schemes such as the Dewey Decimal System and the LCC, but also, for example, geographic classifications, which might be constructed as layers of smaller and smaller hierarchical longitude/latitude boxes. Pharos is designed to work with sophisticated queries which utilize subjects, geographical locations, temporal specifications, and other types of information domains. The Pharos architecture requires that hierarchically structured collection metadata be extracted so that it can be partitioned in such a way as to greatly enhance scalability. Automated classification is important to Pharos because it allows information sources to extract the requisite collection metadata automatically that must be distributed.

Bock, H.-H.: Datenanalyse zur Strukturierung und Ordnung von Information (1989) 0.00

0.0034601947 = product of:
  0.017300973 = sum of:
    0.017300973 = product of:
      0.034601945 = sum of:
        0.034601945 = weight(_text_:22 in 141) [ClassicSimilarity], result of:
          0.034601945 = score(doc=141,freq=2.0), product of:
            0.12776221 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.036484417 = queryNorm
            0.2708308 = fieldWeight in 141, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=141)
      0.5 = coord(1/2)
  0.2 = coord(1/5)

Pages: S.1-22

Search (39 results, page 1 of 2)

Authors

Years

Languages

Types

Themes