Search (70 results, page 1 of 4)

Reiner, U.: Automatische DDC-Klassifizierung bibliografischer Titeldatensätze der Deutschen Nationalbibliografie (2009) 0.02
```
0.022929402 = product of:
  0.045858804 = sum of:
    0.041099545 = weight(_text_:digitale in 3284) [ClassicSimilarity], result of:
      0.041099545 = score(doc=3284,freq=2.0), product of:
        0.18027179 = queryWeight, product of:
          5.158747 = idf(docFreq=690, maxDocs=44218)
          0.034944877 = queryNorm
        0.22798656 = fieldWeight in 3284, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.158747 = idf(docFreq=690, maxDocs=44218)
          0.03125 = fieldNorm(doc=3284)
    0.004759258 = weight(_text_:information in 3284) [ClassicSimilarity], result of:
      0.004759258 = score(doc=3284,freq=2.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.0775819 = fieldWeight in 3284, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.03125 = fieldNorm(doc=3284)
  0.5 = coord(2/4)
```
Abstract

Die Menge der zu klassifizierenden Veröffentlichungen steigt spätestens seit der Existenz des World Wide Web schneller an, als sie intellektuell sachlich erschlossen werden kann. Daher werden Verfahren gesucht, um die Klassifizierung von Textobjekten zu automatisieren oder die intellektuelle Klassifizierung zumindest zu unterstützen. Seit 1968 gibt es Verfahren zur automatischen Dokumentenklassifizierung (Information Retrieval, kurz: IR) und seit 1992 zur automatischen Textklassifizierung (ATC: Automated Text Categorization). Seit immer mehr digitale Objekte im World Wide Web zur Verfügung stehen, haben Arbeiten zur automatischen Textklassifizierung seit ca. 1998 verstärkt zugenommen. Dazu gehören seit 1996 auch Arbeiten zur automatischen DDC-Klassifizierung bzw. RVK-Klassifizierung von bibliografischen Titeldatensätzen und Volltextdokumenten. Bei den Entwicklungen handelt es sich unseres Wissens bislang um experimentelle und keine im ständigen Betrieb befindlichen Systeme. Auch das VZG-Projekt Colibri/DDC ist seit 2006 u. a. mit der automatischen DDC-Klassifizierung befasst. Die diesbezüglichen Untersuchungen und Entwicklungen dienen zur Beantwortung der Forschungsfrage: "Ist es möglich, eine inhaltlich stimmige DDC-Titelklassifikation aller GVK-PLUS-Titeldatensätze automatisch zu erzielen?"
Miyamoto, S.: Information clustering based an fuzzy multisets (2003) 0.01
```
0.0051002675 = product of:
  0.02040107 = sum of:
    0.02040107 = weight(_text_:information in 1071) [ClassicSimilarity], result of:
      0.02040107 = score(doc=1071,freq=12.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.3325631 = fieldWeight in 1071, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1071)
  0.25 = coord(1/4)
```
Abstract

A fuzzy multiset model for information clustering is proposed with application to information retrieval on the World Wide Web. Noting that a search engine retrieves multiple occurrences of the same subjects with possibly different degrees of relevance, we observe that fuzzy multisets provide an appropriate model of information retrieval on the WWW. Information clustering which means both term clustering and document clustering is considered. Three methods of the hard c-means, fuzzy c-means, and an agglomerative method using cluster centers are proposed. Two distances between fuzzy multisets and algorithms for calculating cluster centers are defined. Theoretical properties concerning the clustering algorithms are studied. Illustrative examples are given to show how the algorithms work.

Source

Information processing and management. 39(2003) no.2, S.195-213

Wu, M.; Fuller, M.; Wilkinson, R.: Using clustering and classification approaches in interactive retrieval (2001) 0.00

0.004164351 = product of:
  0.016657405 = sum of:
    0.016657405 = weight(_text_:information in 2666) [ClassicSimilarity], result of:
      0.016657405 = score(doc=2666,freq=2.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.27153665 = fieldWeight in 2666, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.109375 = fieldNorm(doc=2666)
  0.25 = coord(1/4)

Source: Information processing and management. 37(2001) no.3, S.459-484

Guerrero-Bote, V.P.; Moya Anegón, F. de; Herrero Solana, V.: Document organization using Kohonen's algorithm (2002) 0.00
```
0.004121639 = product of:
  0.016486555 = sum of:
    0.016486555 = weight(_text_:information in 2564) [ClassicSimilarity], result of:
      0.016486555 = score(doc=2564,freq=6.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.2687516 = fieldWeight in 2564, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0625 = fieldNorm(doc=2564)
  0.25 = coord(1/4)
```
Abstract

The classification of documents from a bibliographic database is a task that is linked to processes of information retrieval based on partial matching. A method is described of vectorizing reference documents from LISA which permits their topological organization using Kohonen's algorithm. As an example a map is generated of 202 documents from LISA, and an analysis is made of the possibilities of this type of neural network with respect to the development of information retrieval systems based on graphical browsing.

Source

Information processing and management. 38(2002) no.1, S.79-89
Mukhopadhyay, S.; Peng, S.; Raje, R.; Palakal, M.; Mostafa, J.: Multi-agent information classification using dynamic acquaintance lists (2003) 0.00
```
0.0039907596 = product of:
  0.015963038 = sum of:
    0.015963038 = weight(_text_:information in 1755) [ClassicSimilarity], result of:
      0.015963038 = score(doc=1755,freq=10.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.2602176 = fieldWeight in 1755, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=1755)
  0.25 = coord(1/4)
```
Abstract

There has been considerable interest in recent years in providing automated information services, such as information classification, by means of a society of collaborative agents. These agents augment each other's knowledge structures (e.g., the vocabularies) and assist each other in providing efficient information services to a human user. However, when the number of agents present in the society increases, exhaustive communication and collaboration among agents result in a [arge communication overhead and increased delays in response time. This paper introduces a method to achieve selective interaction with a relatively small number of potentially useful agents, based an simple agent modeling and acquaintance lists. The key idea presented here is that the acquaintance list of an agent, representing a small number of other agents to be collaborated with, is dynamically adjusted. The best acquaintances are automatically discovered using a learning algorithm, based an the past history of collaboration. Experimental results are presented to demonstrate that such dynamically learned acquaintance lists can lead to high quality of classification, while significantly reducing the delay in response time.

Source

Journal of the American Society for Information Science and technology. 54(2003) no.10, S.966-975
Denoyer, L.; Gallinari, P.: Bayesian network model for semi-structured document classification (2004) 0.00
```
0.0039907596 = product of:
  0.015963038 = sum of:
    0.015963038 = weight(_text_:information in 995) [ClassicSimilarity], result of:
      0.015963038 = score(doc=995,freq=10.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.2602176 = fieldWeight in 995, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=995)
  0.25 = coord(1/4)
```
Abstract

Recently, a new community has started to emerge around the development of new information research methods for searching and analyzing semi-structured and XML like documents. The goal is to handle both content and structural information, and to deal with different types of information content (text, image, etc.). We consider here the task of structured document classification. We propose a generative model able to handle both structure and content which is based on Bayesian networks. We then show how to transform this generative model into a discriminant classifier using the method of Fisher kernel. The model is then extended for dealing with different types of content information (here text and images). The model was tested on three databases: the classical webKB corpus composed of HTML pages, the new INEX corpus which has become a reference in the field of ad-hoc retrieval for XML documents, and a multimedia corpus of Web pages.

Source

Information processing and management. 40(2004) no.5, S.807-827
Leroy, G.; Miller, T.; Rosemblat, G.; Browne, A.: ¬A balanced approach to health information evaluation : a vocabulary-based naïve Bayes classifier and readability formulas (2008) 0.00
```
0.0039907596 = product of:
  0.015963038 = sum of:
    0.015963038 = weight(_text_:information in 1998) [ClassicSimilarity], result of:
      0.015963038 = score(doc=1998,freq=10.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.2602176 = fieldWeight in 1998, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=1998)
  0.25 = coord(1/4)
```
Abstract

Since millions seek health information online, it is vital for this information to be comprehensible. Most studies use readability formulas, which ignore vocabulary, and conclude that online health information is too difficult. We developed a vocabularly-based, naïve Bayes classifier to distinguish between three difficulty levels in text. It proved 98% accurate in a 250-document evaluation. We compared our classifier with readability formulas for 90 new documents with different origins and asked representative human evaluators, an expert and a consumer, to judge each document. Average readability grade levels for educational and commercial pages was 10th grade or higher, too difficult according to current literature. In contrast, the classifier showed that 70-90% of these pages were written at an intermediate, appropriate level indicating that vocabulary usage is frequently appropriate in text considered too difficult by readability formula evaluations. The expert considered the pages more difficult for a consumer than the consumer did.

Source

Journal of the American Society for Information Science and Technology. 59(2008) no.9, S.1409-1419

Major, R.L.; Ragsdale, C.T.: ¬An aggregation approach to the classification problem using multiple prediction experts (2000) 0.00

0.0035694437 = product of:
  0.014277775 = sum of:
    0.014277775 = weight(_text_:information in 3789) [ClassicSimilarity], result of:
      0.014277775 = score(doc=3789,freq=2.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.23274569 = fieldWeight in 3789, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.09375 = fieldNorm(doc=3789)
  0.25 = coord(1/4)

Source: Information processing and management. 36(2000) no.4, S.683-696

Yu, W.; Gong, Y.: Document clustering by concept factorization (2004) 0.00

0.0035694437 = product of:
  0.014277775 = sum of:
    0.014277775 = weight(_text_:information in 4084) [ClassicSimilarity], result of:
      0.014277775 = score(doc=4084,freq=2.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.23274569 = fieldWeight in 4084, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.09375 = fieldNorm(doc=4084)
  0.25 = coord(1/4)

Source: SIGIR'04: Proceedings of the 27th Annual International ACM-SIGIR Conference an Research and Development in Information Retrieval. Ed.: K. Järvelin, u.a

Yi, K.: Challenges in automated classification using library classification schemes (2006) 0.00

0.0033653039 = product of:
  0.013461215 = sum of:
    0.013461215 = weight(_text_:information in 5810) [ClassicSimilarity], result of:
      0.013461215 = score(doc=5810,freq=4.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.21943474 = fieldWeight in 5810, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0625 = fieldNorm(doc=5810)
  0.25 = coord(1/4)

Abstract: A major library classification scheme has long been standard classification framework for information sources in traditional library environment, and text classification (TC) becomes a popular and attractive tool of organizing digital information. This paper gives an overview of previous projects and studies on TC using major library classification schemes, and summarizes a discussion of TC research challenges.

Yoon, Y.; Lee, G.G.: Efficient implementation of associative classifiers for document classification (2007) 0.00
```
0.003091229 = product of:
  0.012364916 = sum of:
    0.012364916 = weight(_text_:information in 909) [ClassicSimilarity], result of:
      0.012364916 = score(doc=909,freq=6.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.20156369 = fieldWeight in 909, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=909)
  0.25 = coord(1/4)
```
Abstract

In practical text classification tasks, the ability to interpret the classification result is as important as the ability to classify exactly. Associative classifiers have many favorable characteristics such as rapid training, good classification accuracy, and excellent interpretation. However, associative classifiers also have some obstacles to overcome when they are applied in the area of text classification. The target text collection generally has a very high dimension, thus the training process might take a very long time. We propose a feature selection based on the mutual information between the word and class variables to reduce the space dimension of the associative classifiers. In addition, the training process of the associative classifier produces a huge amount of classification rules, which makes the prediction with a new document ineffective. We resolve this by introducing a new efficient method for storing and pruning classification rules. This method can also be used when predicting a test document. Experimental results using the 20-newsgroups dataset show many benefits of the associative classification in both training and predicting when applied to a real world problem.

Footnote

Beitrag in: Special issue on AIRS2005: Information Retrieval Research in Asia

Source

Information processing and management. 43(2007) no.2, S.393-405
Montesi, M.; Navarrete, T.: Classifying web genres in context : A case study documenting the web genres used by a software engineer (2008) 0.00
```
0.003091229 = product of:
  0.012364916 = sum of:
    0.012364916 = weight(_text_:information in 2100) [ClassicSimilarity], result of:
      0.012364916 = score(doc=2100,freq=6.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.20156369 = fieldWeight in 2100, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=2100)
  0.25 = coord(1/4)
```
Abstract

This case study analyzes the Internet-based resources that a software engineer uses in his daily work. Methodologically, we studied the web browser history of the participant, classifying all the web pages he had seen over a period of 12 days into web genres. We interviewed him before and after the analysis of the web browser history. In the first interview, he spoke about his general information behavior; in the second, he commented on each web genre, explaining why and how he used them. As a result, three approaches allow us to describe the set of 23 web genres obtained: (a) the purposes they serve for the participant; (b) the role they play in the various work and search phases; (c) and the way they are used in combination with each other. Further observations concern the way the participant assesses quality of web-based resources, and his information behavior as a software engineer.

Source

Information processing and management. 44(2008) no.4, S.1410-1430
Choi, B.; Peng, X.: Dynamic and hierarchical classification of Web pages (2004) 0.00
```
0.003091229 = product of:
  0.012364916 = sum of:
    0.012364916 = weight(_text_:information in 2555) [ClassicSimilarity], result of:
      0.012364916 = score(doc=2555,freq=6.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.20156369 = fieldWeight in 2555, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=2555)
  0.25 = coord(1/4)
```
Abstract

Automatic classification of Web pages is an effective way to organise the vast amount of information and to assist in retrieving relevant information from the Internet. Although many automatic classification systems have been proposed, most of them ignore the conflict between the fixed number of categories and the growing number of Web pages being added into the systems. They also require searching through all existing categories to make any classification. This article proposes a dynamic and hierarchical classification system that is capable of adding new categories as required, organising the Web pages into a tree structure, and classifying Web pages by searching through only one path of the tree. The proposed single-path search technique reduces the search complexity from (n) to (log(n)). Test results show that the system improves the accuracy of classification by 6 percent in comparison to related systems. The dynamic-category expansion technique also achieves satisfying results for adding new categories into the system as required.

Source

Online information review. 28(2004) no.2, S.139-147
Liu, R.-L.: Context recognition for hierarchical text classification (2009) 0.00
```
0.003091229 = product of:
  0.012364916 = sum of:
    0.012364916 = weight(_text_:information in 2760) [ClassicSimilarity], result of:
      0.012364916 = score(doc=2760,freq=6.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.20156369 = fieldWeight in 2760, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=2760)
  0.25 = coord(1/4)
```
Abstract

Information is often organized as a text hierarchy. A hierarchical text-classification system is thus essential for the management, sharing, and dissemination of information. It aims to automatically classify each incoming document into zero, one, or several categories in the text hierarchy. In this paper, we present a technique called CRHTC (context recognition for hierarchical text classification) that performs hierarchical text classification by recognizing the context of discussion (COD) of each category. A category's COD is governed by its ancestor categories, whose contents indicate contextual backgrounds of the category. A document may be classified into a category only if its content matches the category's COD. CRHTC does not require any trials to manually set parameters, and hence is more portable and easier to implement than other methods. It is empirically evaluated under various conditions. The results show that CRHTC achieves both better and more stable performance than several hierarchical and nonhierarchical text-classification methodologies.

Source

Journal of the American Society for Information Science and Technology. 60(2009) no.4, S.803-813
Li, T.; Zhu, S.; Ogihara, M.: Hierarchical document classification using automatically generated hierarchy (2007) 0.00
```
0.003091229 = product of:
  0.012364916 = sum of:
    0.012364916 = weight(_text_:information in 4797) [ClassicSimilarity], result of:
      0.012364916 = score(doc=4797,freq=6.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.20156369 = fieldWeight in 4797, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=4797)
  0.25 = coord(1/4)
```
Abstract

Automated text categorization has witnessed a booming interest with the exponential growth of information and the ever-increasing needs for organizations. The underlying hierarchical structure identifies the relationships of dependence between different categories and provides valuable sources of information for categorization. Although considerable research has been conducted in the field of hierarchical document categorization, little has been done on automatic generation of topic hierarchies. In this paper, we propose the method of using linear discriminant projection to generate more meaningful intermediate levels of hierarchies in large flat sets of classes. The linear discriminant projection approach first transforms all documents onto a low-dimensional space and then clusters the categories into hier- archies accordingly. The paper also investigates the effect of using generated hierarchical structure for text classification. Our experiments show that generated hierarchies improve classification performance in most cases.

Source

Journal of intelligent information systems. 29(2007) no.2, S.211-230
Chung, Y.M.; Lee, J.Y.: ¬A corpus-based approach to comparative evaluation of statistical term association measures (2001) 0.00
```
0.0029745363 = product of:
  0.011898145 = sum of:
    0.011898145 = weight(_text_:information in 5769) [ClassicSimilarity], result of:
      0.011898145 = score(doc=5769,freq=8.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.19395474 = fieldWeight in 5769, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5769)
  0.25 = coord(1/4)
```
Abstract

Statistical association measures have been widely applied in information retrieval research, usually employing a clustering of documents or terms on the basis of their relationships. Applications of the association measures for term clustering include automatic thesaurus construction and query expansion. This research evaluates the similarity of six association measures by comparing the relationship and behavior they demonstrate in various analyses of a test corpus. Analysis techniques include comparisons of highly ranked term pairs and term clusters, analyses of the correlation among the association measures using Pearson's correlation coefficient and MDS mapping, and an analysis of the impact of a term frequency on the association values by means of z-score. The major findings of the study are as follows: First, the most similar association measures are mutual information and Yule's coefficient of colligation Y, whereas cosine and Jaccard coefficients, as well as X**2 statistic and likelihood ratio, demonstrate quite similar behavior for terms with high frequency. Second, among all the measures, the X**2 statistic is the least affected by the frequency of terms. Third, although cosine and Jaccard coefficients tend to emphasize high frequency terms, mutual information and Yule's Y seem to overestimate rare terms

Source

Journal of the American Society for Information Science and technology. 52(2001) no.4, S.283-296

Reiner, U.: Automatische DDC-Klassifizierung von bibliografischen Titeldatensätzen (2009) 0.00

0.0029745363 = product of:
  0.011898145 = sum of:
    0.011898145 = weight(_text_:information in 611) [ClassicSimilarity], result of:
      0.011898145 = score(doc=611,freq=2.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.19395474 = fieldWeight in 611, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.078125 = fieldNorm(doc=611)
  0.25 = coord(1/4)

Content: Präsentation zum Vortrag anlässlich des 98. Deutscher Bibliothekartag in Erfurt: Ein neuer Blick auf Bibliotheken; TK10: Information erschließen und recherchieren Inhalte erschließen - mit neuen Tools

Shen, D.; Chen, Z.; Yang, Q.; Zeng, H.J.; Zhang, B.; Lu, Y.; Ma, W.Y.: Web page classification through summarization (2004) 0.00

0.0029745363 = product of:
  0.011898145 = sum of:
    0.011898145 = weight(_text_:information in 4132) [ClassicSimilarity], result of:
      0.011898145 = score(doc=4132,freq=2.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.19395474 = fieldWeight in 4132, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.078125 = fieldNorm(doc=4132)
  0.25 = coord(1/4)

Source: SIGIR'04: Proceedings of the 27th Annual International ACM-SIGIR Conference an Research and Development in Information Retrieval. Ed.: K. Järvelin, u.a

Calado, P.; Cristo, M.; Gonçalves, M.A.; Moura, E.S. de; Ribeiro-Neto, B.; Ziviani, N.: Link-based similarity measures for the classification of Web documents (2006) 0.00
```
0.0029745363 = product of:
  0.011898145 = sum of:
    0.011898145 = weight(_text_:information in 4921) [ClassicSimilarity], result of:
      0.011898145 = score(doc=4921,freq=8.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.19395474 = fieldWeight in 4921, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4921)
  0.25 = coord(1/4)
```
Abstract

Traditional text-based document classifiers tend to perform poorly an the Web. Text in Web documents is usually noisy and often does not contain enough information to determine their topic. However, the Web provides a different source that can be useful to document classification: its hyperlink structure. In this work, the authors evaluate how the link structure of the Web can be used to determine a measure of similarity appropriate for document classification. They experiment with five different similarity measures and determine their adequacy for predicting the topic of a Web page. Tests performed an a Web directory Show that link information alone allows classifying documents with an average precision of 86%. Further, when combined with a traditional textbased classifier, precision increases to values of up to 90%, representing gains that range from 63 to 132% over the use of text-based classification alone. Because the measures proposed in this article are straightforward to compute, they provide a practical and effective solution for Web classification and related information retrieval tasks. Further, the authors provide an important set of guidelines an how link structure can be used effectively to classify Web documents.

Source

Journal of the American Society for Information Science and Technology. 57(2006) no.2, S.208-221
Mengle, S.; Goharian, N.: Passage detection using text classification (2009) 0.00
```
0.0029745363 = product of:
  0.011898145 = sum of:
    0.011898145 = weight(_text_:information in 2765) [ClassicSimilarity], result of:
      0.011898145 = score(doc=2765,freq=8.0), product of:
        0.06134496 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.034944877 = queryNorm
        0.19395474 = fieldWeight in 2765, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2765)
  0.25 = coord(1/4)
```
Abstract

Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retrieval is the determination of the degree of relevance of blocks of text, namely passages, comprising a document. Rather than determining the relevance of a document in its entirety, passage retrieval determines the relevance of the individual passages. As such, modified traditional information-retrieval techniques compare terms found in user queries with the individual passages to determine a similarity score for passages of interest. In passage detection, passages are classified into predetermined categories. More often than not, passage detection techniques are deployed to detect hidden paragraphs in documents. That is, to hide information, documents are injected with hidden text into passages. Rather than matching query terms against passages to determine their relevance, using text-mining techniques, the passages are classified. Those documents with hidden passages are defined as infected. Thus, simply stated, passage retrieval is the search for passages relevant to a user query, while passage detection is the classification of passages. That is, in passage detection, passages are labeled with one or more categories from a set of predetermined categories. We present a keyword-based dynamic passage approach (KDP) and demonstrate that KDP outperforms statistically significantly (99% confidence) the other document-splitting approaches by 12% to 18% in the passage detection and passage category-prediction tasks. Furthermore, we evaluate the effects of the feature selection, passage length, ambiguous passages, and finally training-data category distribution on passage-detection accuracy.

Source

Journal of the American Society for Information Science and Technology. 60(2009) no.4, S.814-825

Search (70 results, page 1 of 4)

Authors

Languages

Types

Themes

Subjects