Search (32 results, page 1 of 2)

Mengle, S.; Goharian, N.: Passage detection using text classification (2009) 0.07
```
0.0666666 = product of:
  0.0999999 = sum of:
    0.08328357 = weight(_text_:query in 2765) [ClassicSimilarity], result of:
      0.08328357 = score(doc=2765,freq=4.0), product of:
        0.22937049 = queryWeight, product of:
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.049352113 = queryNorm
        0.3630963 = fieldWeight in 2765, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2765)
    0.016716326 = product of:
      0.03343265 = sum of:
        0.03343265 = weight(_text_:22 in 2765) [ClassicSimilarity], result of:
          0.03343265 = score(doc=2765,freq=2.0), product of:
            0.1728227 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.049352113 = queryNorm
            0.19345059 = fieldWeight in 2765, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2765)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)
```
Abstract

Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retrieval is the determination of the degree of relevance of blocks of text, namely passages, comprising a document. Rather than determining the relevance of a document in its entirety, passage retrieval determines the relevance of the individual passages. As such, modified traditional information-retrieval techniques compare terms found in user queries with the individual passages to determine a similarity score for passages of interest. In passage detection, passages are classified into predetermined categories. More often than not, passage detection techniques are deployed to detect hidden paragraphs in documents. That is, to hide information, documents are injected with hidden text into passages. Rather than matching query terms against passages to determine their relevance, using text-mining techniques, the passages are classified. Those documents with hidden passages are defined as infected. Thus, simply stated, passage retrieval is the search for passages relevant to a user query, while passage detection is the classification of passages. That is, in passage detection, passages are labeled with one or more categories from a set of predetermined categories. We present a keyword-based dynamic passage approach (KDP) and demonstrate that KDP outperforms statistically significantly (99% confidence) the other document-splitting approaches by 12% to 18% in the passage detection and passage category-prediction tasks. Furthermore, we evaluate the effects of the feature selection, passage length, ambiguous passages, and finally training-data category distribution on passage-detection accuracy.

Date

22. 3.2009 19:14:43

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.07

0.06562922 = product of:
  0.09844383 = sum of:
    0.078384236 = product of:
      0.2351527 = sum of:
        0.2351527 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
          0.2351527 = score(doc=562,freq=2.0), product of:
            0.41840777 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.049352113 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.33333334 = coord(1/3)
    0.020059591 = product of:
      0.040119182 = sum of:
        0.040119182 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
          0.040119182 = score(doc=562,freq=2.0), product of:
            0.1728227 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.049352113 = queryNorm
            0.23214069 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.5 = coord(1/2)
  0.6666667 = coord(2/3)

Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32

Wu, K.J.; Chen, M.-C.; Sun, Y.: Automatic topics discovery from hyperlinked documents (2004) 0.04
```
0.040800452 = product of:
  0.12240136 = sum of:
    0.12240136 = weight(_text_:query in 2563) [ClassicSimilarity], result of:
      0.12240136 = score(doc=2563,freq=6.0), product of:
        0.22937049 = queryWeight, product of:
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.049352113 = queryNorm
        0.5336404 = fieldWeight in 2563, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.046875 = fieldNorm(doc=2563)
  0.33333334 = coord(1/3)
```
Abstract

Topic discovery is an important means for marketing, e-Business and social science studies. As well, it can be applied to various purposes, such as identifying a group with certain properties and observing the emergence and diminishment of a certain cyber community. Previous topic discovery work (J.M. Kleinberg, Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, p. 668) requires manual judgment of usefulness of outcomes and is thus incapable of handling the explosive growth of the Internet. In this paper, we propose the Automatic Topic Discovery (ATD) method, which combines a method of base set construction, a clustering algorithm and an iterative principal eigenvector computation method to discover the topics relevant to a given query without using manual examination. Given a query, ATD returns with topics associated with the query and top representative pages for each topic. Our experiments show that the ATD method performs better than the traditional eigenvector method in terms of computation time and topic discovery quality.
Yilmaz, T.; Ozcan, R.; Altingovde, I.S.; Ulusoy, Ö.: Improving educational web search for question-like queries through subject classification (2019) 0.04
```
0.039260253 = product of:
  0.11778076 = sum of:
    0.11778076 = weight(_text_:query in 5041) [ClassicSimilarity], result of:
      0.11778076 = score(doc=5041,freq=8.0), product of:
        0.22937049 = queryWeight, product of:
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.049352113 = queryNorm
        0.5134957 = fieldWeight in 5041, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5041)
  0.33333334 = coord(1/3)
```
Abstract

Students use general web search engines as their primary source of research while trying to find answers to school-related questions. Although search engines are highly relevant for the general population, they may return results that are out of educational context. Another rising trend; social community question answering websites are the second choice for students who try to get answers from other peers online. We attempt discovering possible improvements in educational search by leveraging both of these information sources. For this purpose, we first implement a classifier for educational questions. This classifier is built by an ensemble method that employs several regular learning algorithms and retrieval based approaches that utilize external resources. We also build a query expander to facilitate classification. We further improve the classification using search engine results and obtain 83.5% accuracy. Although our work is entirely based on the Turkish language, the features could easily be mapped to other languages as well. In order to find out whether search engine ranking can be improved in the education domain using the classification model, we collect and label a set of query results retrieved from a general web search engine. We propose five ad-hoc methods to improve search ranking based on the idea that the query-document category relation is an indicator of relevance. We evaluate these methods for overall performance, varying query length and based on factoid and non-factoid queries. We show that some of the methods significantly improve the rankings in the education domain.
Kwon, O.W.; Lee, J.H.: Text categorization based on k-nearest neighbor approach for web site classification (2003) 0.03
```
0.034722965 = product of:
  0.10416889 = sum of:
    0.10416889 = product of:
      0.20833778 = sum of:
        0.20833778 = weight(_text_:page in 1070) [ClassicSimilarity], result of:
          0.20833778 = score(doc=1070,freq=12.0), product of:
            0.27565226 = queryWeight, product of:
              5.5854197 = idf(docFreq=450, maxDocs=44218)
              0.049352113 = queryNorm
            0.7557993 = fieldWeight in 1070, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              5.5854197 = idf(docFreq=450, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1070)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)
```
Abstract

Automatic categorization is a viable method to deal with the scaling problem on the World Wide Web. For Web site classification, this paper proposes the use of Web pages linked with the home page in a different manner from the sole use of home pages in previous research. To implement our proposed method, we derive a scheme for Web site classification based on the k-nearest neighbor (k-NN) approach. It consists of three phases: Web page selection (connectivity analysis), Web page classification, and Web site classification. Given a Web site, the Web page selection chooses several representative Web pages using connectivity analysis. The k-NN classifier next classifies each of the selected Web pages. Finally, the classified Web pages are extended to a classification of the entire Web site. To improve performance, we supplement the k-NN approach with a feature selection method and a term weighting scheme using markup tags, and also reform its document-document similarity measure. In our experiments on a Korean commercial Web directory, the proposed system, using both a home page and its linked pages, improved the performance of micro-averaging breakeven point by 30.02%, compared with an ordinary classification which uses a home page only.

Shen, D.; Chen, Z.; Yang, Q.; Zeng, H.J.; Zhang, B.; Lu, Y.; Ma, W.Y.: Web page classification through summarization (2004) 0.03

0.02835118 = product of:
  0.08505354 = sum of:
    0.08505354 = product of:
      0.17010708 = sum of:
        0.17010708 = weight(_text_:page in 4132) [ClassicSimilarity], result of:
          0.17010708 = score(doc=4132,freq=2.0), product of:
            0.27565226 = queryWeight, product of:
              5.5854197 = idf(docFreq=450, maxDocs=44218)
              0.049352113 = queryNorm
            0.6171075 = fieldWeight in 4132, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.5854197 = idf(docFreq=450, maxDocs=44218)
              0.078125 = fieldNorm(doc=4132)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Huang, Y.-L.: ¬A theoretic and empirical research of cluster indexing for Mandarine Chinese full text document (1998) 0.03
```
0.027482178 = product of:
  0.08244653 = sum of:
    0.08244653 = weight(_text_:query in 513) [ClassicSimilarity], result of:
      0.08244653 = score(doc=513,freq=2.0), product of:
        0.22937049 = queryWeight, product of:
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.049352113 = queryNorm
        0.35944697 = fieldWeight in 513, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.0546875 = fieldNorm(doc=513)
  0.33333334 = coord(1/3)
```
Abstract

Since most popular commercialized systems for full text retrieval are designed with full text scaning and Boolean logic query mode, these systems use an oversimplified relationship between the indexing form and the content of document. Reports the use of Singular Value Decomposition (SVD) to develop a Cluster Indexing Model (CIM) based on a Vector Space Model (VSM) in orer to explore the index theory of cluster indexing for chinese full text documents. From a series of experiments, it was found that the indexing performance of CIM is better than traditional VSM, and has almost equivalent effectiveness of the authority control of index terms
Dolin, R.; Agrawal, D.; El Abbadi, A.; Pearlman, J.: Using automated classification for summarizing and selecting heterogeneous information sources (1998) 0.03
```
0.026336579 = product of:
  0.079009734 = sum of:
    0.079009734 = weight(_text_:query in 1253) [ClassicSimilarity], result of:
      0.079009734 = score(doc=1253,freq=10.0), product of:
        0.22937049 = queryWeight, product of:
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.049352113 = queryNorm
        0.34446338 = fieldWeight in 1253, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.0234375 = fieldNorm(doc=1253)
  0.33333334 = coord(1/3)
```
Abstract

Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. Important sources of information include not only traditional databases with structured data and queries, but also increasing numbers of non-traditional, semi- or unstructured collections such as Web sites, FTP archives, etc. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC), within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. For such a system to be useful in a large-scale distributed environment, it must be easy to use for both collection managers and users. As a result, it must be possible to classify documents automatically within a classification scheme. Furthermore, there must be a straightforward and intuitive interface with which the user may use the scheme to assist in information retrieval (IR). Our work with the Alexandria Digital Library (ADL) Project focuses on geo-referenced information, whether text, maps, aerial photographs, or satellite images. As a result, we have emphasized techniques which work with both text and non-text, such as combined textual and graphical queries, multi-dimensional indexing, and IR methods which are not solely dependent on words or phrases. Part of this work involves locating relevant online sources of information. In particular, we have designed and are currently testing aspects of an architecture, Pharos, which we believe will scale up to 1.000.000 heterogeneous sources. Pharos accommodates heterogeneity in content and format, both among multiple sources as well as within a single source. That is, we consider sources to include Web sites, FTP archives, newsgroups, and full digital libraries; all of these systems can include a wide variety of content and multimedia data formats. Pharos is based on the use of hierarchical classification schemes. These include not only well-known 'subject' (or 'concept') based schemes such as the Dewey Decimal System and the LCC, but also, for example, geographic classifications, which might be constructed as layers of smaller and smaller hierarchical longitude/latitude boxes. Pharos is designed to work with sophisticated queries which utilize subjects, geographical locations, temporal specifications, and other types of information domains. The Pharos architecture requires that hierarchically structured collection metadata be extracted so that it can be partitioned in such a way as to greatly enhance scalability. Automated classification is important to Pharos because it allows information sources to extract the requisite collection metadata automatically that must be distributed.
We are currently experimenting with newsgroups as collections. We have built an initial prototype which automatically classifies and summarizes newsgroups within the LCC. (The prototype can be tested below, and more details may be found at http://pharos.alexandria.ucsb.edu/). The prototype uses electronic library catalog records as a `training set' and Latent Semantic Indexing (LSI) for IR. We use the training set to build a rich set of classification terminology, and associate these terms with the relevant categories in the LCC. This association between terms and classification categories allows us to relate users' queries to nodes in the LCC so that users can select appropriate query categories. Newsgroups are similarly associated with classification categories. Pharos then matches the categories selected by users to relevant newsgroups. In principle, this approach allows users to exclude newsgroups that might have been selected based on an unintended meaning of a query term, and to include newsgroups with relevant content even though the exact query terms may not have been used. This work is extensible to other types of classification, including geographical, temporal, and image feature. Before discussing the methodology of the collection summarization and selection, we first present an online demonstration below. The demonstration is not intended to be a complete end-user interface. Rather, it is intended merely to offer a view of the process to suggest the "look and feel" of the prototype. The demo works as follows. First supply it with a few keywords of interest. The system will then use those terms to try to return to you the most relevant subject categories within the LCC. Assuming that the system recognizes any of your terms (it has over 400,000 terms indexed), it will give you a list of 15 LCC categories sorted by relevancy ranking. From there, you have two choices. The first choice, by clicking on the "News" links, is to get a list of newsgroups which the system has identified as relevant to the LCC category you select. The other choice, by clicking on the LCC ID links, is to enter the LCC hierarchy starting at the category of your choice and navigate the tree until you locate the best category for your query. From there, again, you can get a list of newsgroups by clicking on the "News" links. After having shown this demonstration to many people, we would like to suggest that you first give it easier examples before trying to break it. For example, "prostate cancer" (discussed below), "remote sensing", "investment banking", and "gershwin" all work reasonably well.
Larson, R.R.: Experiments in automatic Library of Congress Classification (1992) 0.02
```
0.023556154 = product of:
  0.07066846 = sum of:
    0.07066846 = weight(_text_:query in 1054) [ClassicSimilarity], result of:
      0.07066846 = score(doc=1054,freq=2.0), product of:
        0.22937049 = queryWeight, product of:
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.049352113 = queryNorm
        0.30809742 = fieldWeight in 1054, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.046875 = fieldNorm(doc=1054)
  0.33333334 = coord(1/3)
```
Abstract

This article presents the results of research into the automatic selection of Library of Congress Classification numbers based on the titles and subject headings in MARC records. The method used in this study was based on partial match retrieval techniques using various elements of new recors (i.e., those to be classified) as "queries", and a test database of classification clusters generated from previously classified MARC records. Sixty individual methods for automatic classification were tested on a set of 283 new records, using all combinations of four different partial match methods, five query types, and three representations of search terms. The results indicate that if the best method for a particular case can be determined, then up to 86% of the new records may be correctly classified. The single method with the best accuracy was able to select the correct classification for about 46% of the new records.
Dolin, R.; Agrawal, D.; El Abbadi, A.; Pearlman, J.: Using automated classification for summarizing and selecting heterogeneous information sources (1998) 0.02
```
0.023556154 = product of:
  0.07066846 = sum of:
    0.07066846 = weight(_text_:query in 316) [ClassicSimilarity], result of:
      0.07066846 = score(doc=316,freq=2.0), product of:
        0.22937049 = queryWeight, product of:
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.049352113 = queryNorm
        0.30809742 = fieldWeight in 316, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.046875 = fieldNorm(doc=316)
  0.33333334 = coord(1/3)
```
Abstract

Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. Important sources of information include not only traditional databases with structured data and queries, but also increasing numbers of non-traditional, semi- or unstructured collections such as Web sites, FTP archives, etc. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC) [10], within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. For such a system to be useful in a large-scale distributed environment, it must be easy to use for both collection managers and users. As a result, it must be possible to classify documents automatically within a classification scheme. Furthermore, there must be a straightforward and intuitive interface with which the user may use the scheme to assist in information retrieval (IR).
Finn, A.; Kushmerick, N.: Learning to classify documents according to genre (2006) 0.02
```
0.023556154 = product of:
  0.07066846 = sum of:
    0.07066846 = weight(_text_:query in 6010) [ClassicSimilarity], result of:
      0.07066846 = score(doc=6010,freq=2.0), product of:
        0.22937049 = queryWeight, product of:
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.049352113 = queryNorm
        0.30809742 = fieldWeight in 6010, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.046875 = fieldNorm(doc=6010)
  0.33333334 = coord(1/3)
```
Abstract

Current document-retrieval tools succeed in locating large numbers of documents relevant to a given query. While search results may be relevant according to the topic of the documents, it is more difficult to identify which of the relevant documents are most suitable for a particular user. Automatic genre analysis (i.e., the ability to distinguish documents according to style) would be a useful tool for identifying documents that are most suitable for a particular user. We investigate the use of machine learning for automatic genre classification. We introduce the idea of domain transfer-genre classifiers should be reusable across multiple topics-which does not arise in standard text classification. We investigate different features for building genre classifiers and their ability to transfer across multiple-topic domains. We also show how different feature-sets can be used in conjunction with each other to improve performance and reduce the number of documents that need to be labeled.
Ozmutlu, S.; Cosar, G.C.: Analyzing the results of automatic new topic identification (2008) 0.02
```
0.023556154 = product of:
  0.07066846 = sum of:
    0.07066846 = weight(_text_:query in 2604) [ClassicSimilarity], result of:
      0.07066846 = score(doc=2604,freq=2.0), product of:
        0.22937049 = queryWeight, product of:
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.049352113 = queryNorm
        0.30809742 = fieldWeight in 2604, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.046875 = fieldNorm(doc=2604)
  0.33333334 = coord(1/3)
```
Abstract

Purpose - Identification of topic changes within a user search session is a key issue in content analysis of search engine user queries. Recently, various studies have focused on new topic identification/session identification of search engine transaction logs, and several problems regarding the estimation of topic shifts and continuations were observed in these studies. This study aims to analyze the reasons for the problems that were encountered as a result of applying automatic new topic identification. Design/methodology/approach - Measures, such as cleaning the data of common words and analyzing the errors of automatic new topic identification, are applied to eliminate the problems in estimating topic shifts and continuations. Findings - The findings show that the resulting errors of automatic new topic identification have a pattern, and further research is required to improve the performance of automatic new topic identification. Originality/value - Improving the performance of automatic new topic identification would be valuable to search engine designers, so that they can develop new clustering and query recommendation algorithms, as well as custom-tailored graphical user interfaces for search engine users.
Malo, P.; Sinha, A.; Wallenius, J.; Korhonen, P.: Concept-based document classification using Wikipedia and value function (2011) 0.02
```
0.023556154 = product of:
  0.07066846 = sum of:
    0.07066846 = weight(_text_:query in 4948) [ClassicSimilarity], result of:
      0.07066846 = score(doc=4948,freq=2.0), product of:
        0.22937049 = queryWeight, product of:
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.049352113 = queryNorm
        0.30809742 = fieldWeight in 4948, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.046875 = fieldNorm(doc=4948)
  0.33333334 = coord(1/3)
```
Abstract

In this article, we propose a new concept-based method for document classification. The conceptual knowledge associated with the words is drawn from Wikipedia. The purpose is to utilize the abundant semantic relatedness information available in Wikipedia in an efficient value function-based query learning algorithm. The procedure learns the value function by solving a simple linear programming problem formulated using the training documents. The learning involves a step-wise iterative process that helps in generating a value function with an appropriate set of concepts (dimensions) chosen from a collection of concepts. Once the value function is formulated, it is utilized to make a decision between relevance and irrelevance. The value assigned to a particular document from the value function can be further used to rank the documents according to their relevance. Reuters newswire documents have been used to evaluate the efficacy of the procedure. An extensive comparison with other frameworks has been performed. The results are promising.
Chung, Y.M.; Lee, J.Y.: ¬A corpus-based approach to comparative evaluation of statistical term association measures (2001) 0.02
```
0.019630127 = product of:
  0.05889038 = sum of:
    0.05889038 = weight(_text_:query in 5769) [ClassicSimilarity], result of:
      0.05889038 = score(doc=5769,freq=2.0), product of:
        0.22937049 = queryWeight, product of:
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.049352113 = queryNorm
        0.25674784 = fieldWeight in 5769, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.6476326 = idf(docFreq=1151, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5769)
  0.33333334 = coord(1/3)
```
Abstract

Statistical association measures have been widely applied in information retrieval research, usually employing a clustering of documents or terms on the basis of their relationships. Applications of the association measures for term clustering include automatic thesaurus construction and query expansion. This research evaluates the similarity of six association measures by comparing the relationship and behavior they demonstrate in various analyses of a test corpus. Analysis techniques include comparisons of highly ranked term pairs and term clusters, analyses of the correlation among the association measures using Pearson's correlation coefficient and MDS mapping, and an analysis of the impact of a term frequency on the association values by means of z-score. The major findings of the study are as follows: First, the most similar association measures are mutual information and Yule's coefficient of colligation Y, whereas cosine and Jaccard coefficients, as well as X**2 statistic and likelihood ratio, demonstrate quite similar behavior for terms with high frequency. Second, among all the measures, the X**2 statistic is the least affected by the frequency of terms. Third, although cosine and Jaccard coefficients tend to emphasize high frequency terms, mutual information and Yule's Y seem to overestimate rare terms
Calado, P.; Cristo, M.; Gonçalves, M.A.; Moura, E.S. de; Ribeiro-Neto, B.; Ziviani, N.: Link-based similarity measures for the classification of Web documents (2006) 0.01
```
0.01417559 = product of:
  0.04252677 = sum of:
    0.04252677 = product of:
      0.08505354 = sum of:
        0.08505354 = weight(_text_:page in 4921) [ClassicSimilarity], result of:
          0.08505354 = score(doc=4921,freq=2.0), product of:
            0.27565226 = queryWeight, product of:
              5.5854197 = idf(docFreq=450, maxDocs=44218)
              0.049352113 = queryNorm
            0.30855376 = fieldWeight in 4921, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.5854197 = idf(docFreq=450, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4921)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)
```
Abstract

Traditional text-based document classifiers tend to perform poorly an the Web. Text in Web documents is usually noisy and often does not contain enough information to determine their topic. However, the Web provides a different source that can be useful to document classification: its hyperlink structure. In this work, the authors evaluate how the link structure of the Web can be used to determine a measure of similarity appropriate for document classification. They experiment with five different similarity measures and determine their adequacy for predicting the topic of a Web page. Tests performed an a Web directory Show that link information alone allows classifying documents with an average precision of 86%. Further, when combined with a traditional textbased classifier, precision increases to values of up to 90%, representing gains that range from 63 to 132% over the use of text-based classification alone. Because the measures proposed in this article are straightforward to compute, they provide a practical and effective solution for Web classification and related information retrieval tasks. Further, the authors provide an important set of guidelines an how link structure can be used effectively to classify Web documents.
Golub, K.; Lykke, M.: Automated classification of web pages in hierarchical browsing (2009) 0.01
```
0.01417559 = product of:
  0.04252677 = sum of:
    0.04252677 = product of:
      0.08505354 = sum of:
        0.08505354 = weight(_text_:page in 3614) [ClassicSimilarity], result of:
          0.08505354 = score(doc=3614,freq=2.0), product of:
            0.27565226 = queryWeight, product of:
              5.5854197 = idf(docFreq=450, maxDocs=44218)
              0.049352113 = queryNorm
            0.30855376 = fieldWeight in 3614, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.5854197 = idf(docFreq=450, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3614)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)
```
Abstract

Purpose - The purpose of this study is twofold: to investigate whether it is meaningful to use the Engineering Index (Ei) classification scheme for browsing, and then, if proven useful, to investigate the performance of an automated classification algorithm based on the Ei classification scheme. Design/methodology/approach - A user study was conducted in which users solved four controlled searching tasks. The users browsed the Ei classification scheme in order to examine the suitability of the classification systems for browsing. The classification algorithm was evaluated by the users who judged the correctness of the automatically assigned classes. Findings - The study showed that the Ei classification scheme is suited for browsing. Automatically assigned classes were on average partly correct, with some classes working better than others. Success of browsing showed to be correlated and dependent on classification correctness. Research limitations/implications - Further research should address problems of disparate evaluations of one and the same web page. Additional reasons behind browsing failures in the Ei classification scheme also need further investigation. Practical implications - Improvements for browsing were identified: describing class captions and/or listing their subclasses from start; allowing for searching for words from class captions with synonym search (easily provided for Ei since the classes are mapped to thesauri terms); when searching for class captions, returning the hierarchical tree expanded around the class in which caption the search term is found. The need for improvements of classification schemes was also indicated. Originality/value - A user-based evaluation of automated subject classification in the context of browsing has not been conducted before; hence the study also presents new findings concerning methodology.

Subramanian, S.; Shafer, K.E.: Clustering (2001) 0.01

0.013373061 = product of:
  0.040119182 = sum of:
    0.040119182 = product of:
      0.080238365 = sum of:
        0.080238365 = weight(_text_:22 in 1046) [ClassicSimilarity], result of:
          0.080238365 = score(doc=1046,freq=2.0), product of:
            0.1728227 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.049352113 = queryNorm
            0.46428138 = fieldWeight in 1046, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.09375 = fieldNorm(doc=1046)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Date: 5. 5.2003 14:17:22

Reiner, U.: Automatische DDC-Klassifizierung von bibliografischen Titeldatensätzen (2009) 0.01

0.011144217 = product of:
  0.03343265 = sum of:
    0.03343265 = product of:
      0.0668653 = sum of:
        0.0668653 = weight(_text_:22 in 611) [ClassicSimilarity], result of:
          0.0668653 = score(doc=611,freq=2.0), product of:
            0.1728227 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.049352113 = queryNorm
            0.38690117 = fieldWeight in 611, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=611)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Date: 22. 8.2009 12:54:24

HaCohen-Kerner, Y. et al.: Classification using various machine learning methods and combinations of key-phrases and visual features (2016) 0.01

0.011144217 = product of:
  0.03343265 = sum of:
    0.03343265 = product of:
      0.0668653 = sum of:
        0.0668653 = weight(_text_:22 in 2748) [ClassicSimilarity], result of:
          0.0668653 = score(doc=2748,freq=2.0), product of:
            0.1728227 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.049352113 = queryNorm
            0.38690117 = fieldWeight in 2748, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=2748)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Date: 1. 2.2016 18:25:22

Bock, H.-H.: Datenanalyse zur Strukturierung und Ordnung von Information (1989) 0.01

0.0078009525 = product of:
  0.023402857 = sum of:
    0.023402857 = product of:
      0.046805713 = sum of:
        0.046805713 = weight(_text_:22 in 141) [ClassicSimilarity], result of:
          0.046805713 = score(doc=141,freq=2.0), product of:
            0.1728227 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.049352113 = queryNorm
            0.2708308 = fieldWeight in 141, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=141)
      0.5 = coord(1/2)
  0.33333334 = coord(1/3)

Pages: S.1-22

Search (32 results, page 1 of 2)

Authors

Years

Languages

Types

Themes