Search (176 results, page 1 of 9)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.39

0.38848594 = product of:
  0.64747655 = sum of:
    0.048844352 = product of:
      0.14653306 = sum of:
        0.14653306 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
          0.14653306 = score(doc=562,freq=2.0), product of:
            0.2607266 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.030753274 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.33333334 = coord(1/3)
    0.14653306 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.14653306 = score(doc=562,freq=2.0), product of:
        0.2607266 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.030753274 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.14653306 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.14653306 = score(doc=562,freq=2.0), product of:
        0.2607266 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.030753274 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.14653306 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.14653306 = score(doc=562,freq=2.0), product of:
        0.2607266 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.030753274 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.14653306 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.14653306 = score(doc=562,freq=2.0), product of:
        0.2607266 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.030753274 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.012499932 = product of:
      0.024999864 = sum of:
        0.024999864 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
          0.024999864 = score(doc=562,freq=2.0), product of:
            0.107692726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.030753274 = queryNorm
            0.23214069 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.5 = coord(1/2)
  0.6 = coord(6/10)

Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32

Zhu, W.Z.; Allen, R.B.: Document clustering using the LSI subspace signature model (2013) 0.03

0.030940691 = product of:
  0.10313563 = sum of:
    0.0062825847 = weight(_text_:information in 690) [ClassicSimilarity], result of:
      0.0062825847 = score(doc=690,freq=2.0), product of:
        0.05398669 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.030753274 = queryNorm
        0.116372846 = fieldWeight in 690, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=690)
    0.08435312 = weight(_text_:ranking in 690) [ClassicSimilarity], result of:
      0.08435312 = score(doc=690,freq=4.0), product of:
        0.16634533 = queryWeight, product of:
          5.4090285 = idf(docFreq=537, maxDocs=44218)
          0.030753274 = queryNorm
        0.5070964 = fieldWeight in 690, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.4090285 = idf(docFreq=537, maxDocs=44218)
          0.046875 = fieldNorm(doc=690)
    0.012499932 = product of:
      0.024999864 = sum of:
        0.024999864 = weight(_text_:22 in 690) [ClassicSimilarity], result of:
          0.024999864 = score(doc=690,freq=2.0), product of:
            0.107692726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.030753274 = queryNorm
            0.23214069 = fieldWeight in 690, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=690)
      0.5 = coord(1/2)
  0.3 = coord(3/10)

Abstract: We describe the latent semantic indexing subspace signature model (LSISSM) for semantic content representation of unstructured text. Grounded on singular value decomposition, the model represents terms and documents by the distribution signatures of their statistical contribution across the top-ranking latent concept dimensions. LSISSM matches term signatures with document signatures according to their mapping coherence between latent semantic indexing (LSI) term subspace and LSI document subspace. LSISSM does feature reduction and finds a low-rank approximation of scalable and sparse term-document matrices. Experiments demonstrate that this approach significantly improves the performance of major clustering algorithms such as standard K-means and self-organizing maps compared with the vector space model and the traditional LSI model. The unique contribution ranking mechanism in LSISSM also improves the initialization of standard K-means compared with random seeding procedure, which sometimes causes low efficiency and effectiveness of clustering. A two-stage initialization strategy based on LSISSM significantly reduces the running time of standard K-means procedures.
Date: 23. 3.2013 13:22:36
Source: Journal of the American Society for Information Science and Technology. 64(2013) no.4, S.844-860

Yilmaz, T.; Ozcan, R.; Altingovde, I.S.; Ulusoy, Ö.: Improving educational web search for question-like queries through subject classification (2019) 0.03
```
0.027973032 = product of:
  0.093243435 = sum of:
    0.007404097 = weight(_text_:information in 5041) [ClassicSimilarity], result of:
      0.007404097 = score(doc=5041,freq=4.0), product of:
        0.05398669 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.030753274 = queryNorm
        0.13714671 = fieldWeight in 5041, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5041)
    0.015545071 = weight(_text_:retrieval in 5041) [ClassicSimilarity], result of:
      0.015545071 = score(doc=5041,freq=2.0), product of:
        0.093026035 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.030753274 = queryNorm
        0.16710453 = fieldWeight in 5041, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5041)
    0.07029427 = weight(_text_:ranking in 5041) [ClassicSimilarity], result of:
      0.07029427 = score(doc=5041,freq=4.0), product of:
        0.16634533 = queryWeight, product of:
          5.4090285 = idf(docFreq=537, maxDocs=44218)
          0.030753274 = queryNorm
        0.42258036 = fieldWeight in 5041, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.4090285 = idf(docFreq=537, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5041)
  0.3 = coord(3/10)
```
Abstract

Students use general web search engines as their primary source of research while trying to find answers to school-related questions. Although search engines are highly relevant for the general population, they may return results that are out of educational context. Another rising trend; social community question answering websites are the second choice for students who try to get answers from other peers online. We attempt discovering possible improvements in educational search by leveraging both of these information sources. For this purpose, we first implement a classifier for educational questions. This classifier is built by an ensemble method that employs several regular learning algorithms and retrieval based approaches that utilize external resources. We also build a query expander to facilitate classification. We further improve the classification using search engine results and obtain 83.5% accuracy. Although our work is entirely based on the Turkish language, the features could easily be mapped to other languages as well. In order to find out whether search engine ranking can be improved in the education domain using the classification model, we collect and label a set of query results retrieved from a general web search engine. We propose five ad-hoc methods to improve search ranking based on the idea that the query-document category relation is an indicator of relevance. We evaluate these methods for overall performance, varying query length and based on factoid and non-factoid queries. We show that some of the methods significantly improve the rankings in the education domain.

Source

Information processing and management. 56(2019) no.1, S.228-246

Reiner, U.: Automatische DDC-Klassifizierung von bibliografischen Titeldatensätzen (2009) 0.02

0.018718302 = product of:
  0.062394336 = sum of:
    0.010470974 = weight(_text_:information in 611) [ClassicSimilarity], result of:
      0.010470974 = score(doc=611,freq=2.0), product of:
        0.05398669 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.030753274 = queryNorm
        0.19395474 = fieldWeight in 611, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.078125 = fieldNorm(doc=611)
    0.031090142 = weight(_text_:retrieval in 611) [ClassicSimilarity], result of:
      0.031090142 = score(doc=611,freq=2.0), product of:
        0.093026035 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.030753274 = queryNorm
        0.33420905 = fieldWeight in 611, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.078125 = fieldNorm(doc=611)
    0.02083322 = product of:
      0.04166644 = sum of:
        0.04166644 = weight(_text_:22 in 611) [ClassicSimilarity], result of:
          0.04166644 = score(doc=611,freq=2.0), product of:
            0.107692726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.030753274 = queryNorm
            0.38690117 = fieldWeight in 611, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=611)
      0.5 = coord(1/2)
  0.3 = coord(3/10)

Content: Präsentation zum Vortrag anlässlich des 98. Deutscher Bibliothekartag in Erfurt: Ein neuer Blick auf Bibliotheken; TK10: Information erschließen und recherchieren Inhalte erschließen - mit neuen Tools
Date: 22. 8.2009 12:54:24
Theme: Klassifikationssysteme im Online-Retrieval

Golub, K.; Soergel, D.; Buchanan, G.; Tudhope, D.; Lykke, M.; Hiom, D.: ¬A framework for evaluating automatic indexing or classification in the context of retrieval (2016) 0.02

0.018564304 = product of:
  0.061881013 = sum of:
    0.0090681305 = weight(_text_:information in 3311) [ClassicSimilarity], result of:
      0.0090681305 = score(doc=3311,freq=6.0), product of:
        0.05398669 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.030753274 = queryNorm
        0.16796975 = fieldWeight in 3311, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3311)
    0.026924854 = weight(_text_:retrieval in 3311) [ClassicSimilarity], result of:
      0.026924854 = score(doc=3311,freq=6.0), product of:
        0.093026035 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.030753274 = queryNorm
        0.28943354 = fieldWeight in 3311, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3311)
    0.02588803 = product of:
      0.05177606 = sum of:
        0.05177606 = weight(_text_:evaluation in 3311) [ClassicSimilarity], result of:
          0.05177606 = score(doc=3311,freq=6.0), product of:
            0.12900078 = queryWeight, product of:
              4.1947007 = idf(docFreq=1811, maxDocs=44218)
              0.030753274 = queryNorm
            0.40136236 = fieldWeight in 3311, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.1947007 = idf(docFreq=1811, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3311)
      0.5 = coord(1/2)
  0.3 = coord(3/10)

Abstract: Tools for automatic subject assignment help deal with scale and sustainability in creating and enriching metadata, establishing more connections across and between resources and enhancing consistency. Although some software vendors and experimental researchers claim the tools can replace manual subject indexing, hard scientific evidence of their performance in operating information environments is scarce. A major reason for this is that research is usually conducted in laboratory conditions, excluding the complexities of real-life systems and situations. The article reviews and discusses issues with existing evaluation approaches such as problems of aboutness and relevance assessments, implying the need to use more than a single "gold standard" method when evaluating indexing and retrieval, and proposes a comprehensive evaluation framework. The framework is informed by a systematic review of the literature on evaluation approaches: evaluating indexing quality directly through assessment by an evaluator or through comparison with a gold standard, evaluating the quality of computer-assisted indexing directly in the context of an indexing workflow, and evaluating indexing quality indirectly through analyzing retrieval performance.
Series: Advances in information science
Source: Journal of the Association for Information Science and Technology. 67(2016) no.1, S.3-16

Mengle, S.; Goharian, N.: Passage detection using text classification (2009) 0.02
```
0.016694225 = product of:
  0.055647418 = sum of:
    0.010470974 = weight(_text_:information in 2765) [ClassicSimilarity], result of:
      0.010470974 = score(doc=2765,freq=8.0), product of:
        0.05398669 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.030753274 = queryNorm
        0.19395474 = fieldWeight in 2765, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2765)
    0.034759834 = weight(_text_:retrieval in 2765) [ClassicSimilarity], result of:
      0.034759834 = score(doc=2765,freq=10.0), product of:
        0.093026035 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.030753274 = queryNorm
        0.37365708 = fieldWeight in 2765, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2765)
    0.01041661 = product of:
      0.02083322 = sum of:
        0.02083322 = weight(_text_:22 in 2765) [ClassicSimilarity], result of:
          0.02083322 = score(doc=2765,freq=2.0), product of:
            0.107692726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.030753274 = queryNorm
            0.19345059 = fieldWeight in 2765, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2765)
      0.5 = coord(1/2)
  0.3 = coord(3/10)
```
Abstract

Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retrieval is the determination of the degree of relevance of blocks of text, namely passages, comprising a document. Rather than determining the relevance of a document in its entirety, passage retrieval determines the relevance of the individual passages. As such, modified traditional information-retrieval techniques compare terms found in user queries with the individual passages to determine a similarity score for passages of interest. In passage detection, passages are classified into predetermined categories. More often than not, passage detection techniques are deployed to detect hidden paragraphs in documents. That is, to hide information, documents are injected with hidden text into passages. Rather than matching query terms against passages to determine their relevance, using text-mining techniques, the passages are classified. Those documents with hidden passages are defined as infected. Thus, simply stated, passage retrieval is the search for passages relevant to a user query, while passage detection is the classification of passages. That is, in passage detection, passages are labeled with one or more categories from a set of predetermined categories. We present a keyword-based dynamic passage approach (KDP) and demonstrate that KDP outperforms statistically significantly (99% confidence) the other document-splitting approaches by 12% to 18% in the passage detection and passage category-prediction tasks. Furthermore, we evaluate the effects of the feature selection, passage length, ambiguous passages, and finally training-data category distribution on passage-detection accuracy.

Date

22. 3.2009 19:14:43

Source

Journal of the American Society for Information Science and Technology. 60(2009) no.4, S.814-825
Dolin, R.; Agrawal, D.; El Abbadi, A.; Pearlman, J.: Using automated classification for summarizing and selecting heterogeneous information sources (1998) 0.02
```
0.015731294 = product of:
  0.05243764 = sum of:
    0.009423877 = weight(_text_:information in 1253) [ClassicSimilarity], result of:
      0.009423877 = score(doc=1253,freq=18.0), product of:
        0.05398669 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.030753274 = queryNorm
        0.17455927 = fieldWeight in 1253, product of:
          4.2426405 = tf(freq=18.0), with freq of:
            18.0 = termFreq=18.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0234375 = fieldNorm(doc=1253)
    0.013190431 = weight(_text_:retrieval in 1253) [ClassicSimilarity], result of:
      0.013190431 = score(doc=1253,freq=4.0), product of:
        0.093026035 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.030753274 = queryNorm
        0.1417929 = fieldWeight in 1253, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0234375 = fieldNorm(doc=1253)
    0.029823331 = weight(_text_:ranking in 1253) [ClassicSimilarity], result of:
      0.029823331 = score(doc=1253,freq=2.0), product of:
        0.16634533 = queryWeight, product of:
          5.4090285 = idf(docFreq=537, maxDocs=44218)
          0.030753274 = queryNorm
        0.17928566 = fieldWeight in 1253, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.4090285 = idf(docFreq=537, maxDocs=44218)
          0.0234375 = fieldNorm(doc=1253)
  0.3 = coord(3/10)
```
Abstract

Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. Important sources of information include not only traditional databases with structured data and queries, but also increasing numbers of non-traditional, semi- or unstructured collections such as Web sites, FTP archives, etc. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC), within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. For such a system to be useful in a large-scale distributed environment, it must be easy to use for both collection managers and users. As a result, it must be possible to classify documents automatically within a classification scheme. Furthermore, there must be a straightforward and intuitive interface with which the user may use the scheme to assist in information retrieval (IR). Our work with the Alexandria Digital Library (ADL) Project focuses on geo-referenced information, whether text, maps, aerial photographs, or satellite images. As a result, we have emphasized techniques which work with both text and non-text, such as combined textual and graphical queries, multi-dimensional indexing, and IR methods which are not solely dependent on words or phrases. Part of this work involves locating relevant online sources of information. In particular, we have designed and are currently testing aspects of an architecture, Pharos, which we believe will scale up to 1.000.000 heterogeneous sources. Pharos accommodates heterogeneity in content and format, both among multiple sources as well as within a single source. That is, we consider sources to include Web sites, FTP archives, newsgroups, and full digital libraries; all of these systems can include a wide variety of content and multimedia data formats. Pharos is based on the use of hierarchical classification schemes. These include not only well-known 'subject' (or 'concept') based schemes such as the Dewey Decimal System and the LCC, but also, for example, geographic classifications, which might be constructed as layers of smaller and smaller hierarchical longitude/latitude boxes. Pharos is designed to work with sophisticated queries which utilize subjects, geographical locations, temporal specifications, and other types of information domains. The Pharos architecture requires that hierarchically structured collection metadata be extracted so that it can be partitioned in such a way as to greatly enhance scalability. Automated classification is important to Pharos because it allows information sources to extract the requisite collection metadata automatically that must be distributed.
We are currently experimenting with newsgroups as collections. We have built an initial prototype which automatically classifies and summarizes newsgroups within the LCC. (The prototype can be tested below, and more details may be found at http://pharos.alexandria.ucsb.edu/). The prototype uses electronic library catalog records as a `training set' and Latent Semantic Indexing (LSI) for IR. We use the training set to build a rich set of classification terminology, and associate these terms with the relevant categories in the LCC. This association between terms and classification categories allows us to relate users' queries to nodes in the LCC so that users can select appropriate query categories. Newsgroups are similarly associated with classification categories. Pharos then matches the categories selected by users to relevant newsgroups. In principle, this approach allows users to exclude newsgroups that might have been selected based on an unintended meaning of a query term, and to include newsgroups with relevant content even though the exact query terms may not have been used. This work is extensible to other types of classification, including geographical, temporal, and image feature. Before discussing the methodology of the collection summarization and selection, we first present an online demonstration below. The demonstration is not intended to be a complete end-user interface. Rather, it is intended merely to offer a view of the process to suggest the "look and feel" of the prototype. The demo works as follows. First supply it with a few keywords of interest. The system will then use those terms to try to return to you the most relevant subject categories within the LCC. Assuming that the system recognizes any of your terms (it has over 400,000 terms indexed), it will give you a list of 15 LCC categories sorted by relevancy ranking. From there, you have two choices. The first choice, by clicking on the "News" links, is to get a list of newsgroups which the system has identified as relevant to the LCC category you select. The other choice, by clicking on the LCC ID links, is to enter the LCC hierarchy starting at the category of your choice and navigate the tree until you locate the best category for your query. From there, again, you can get a list of newsgroups by clicking on the "News" links. After having shown this demonstration to many people, we would like to suggest that you first give it easier examples before trying to break it. For example, "prostate cancer" (discussed below), "remote sensing", "investment banking", and "gershwin" all work reasonably well.

Yoon, Y.; Lee, C.; Lee, G.G.: ¬An effective procedure for constructing a hierarchical text classification system (2006) 0.02

0.015669255 = product of:
  0.078346275 = sum of:
    0.0073296824 = weight(_text_:information in 5273) [ClassicSimilarity], result of:
      0.0073296824 = score(doc=5273,freq=2.0), product of:
        0.05398669 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.030753274 = queryNorm
        0.13576832 = fieldWeight in 5273, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5273)
    0.071016595 = sum of:
      0.04185009 = weight(_text_:evaluation in 5273) [ClassicSimilarity], result of:
        0.04185009 = score(doc=5273,freq=2.0), product of:
          0.12900078 = queryWeight, product of:
            4.1947007 = idf(docFreq=1811, maxDocs=44218)
            0.030753274 = queryNorm
          0.32441732 = fieldWeight in 5273, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.1947007 = idf(docFreq=1811, maxDocs=44218)
            0.0546875 = fieldNorm(doc=5273)
      0.029166508 = weight(_text_:22 in 5273) [ClassicSimilarity], result of:
        0.029166508 = score(doc=5273,freq=2.0), product of:
          0.107692726 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.030753274 = queryNorm
          0.2708308 = fieldWeight in 5273, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0546875 = fieldNorm(doc=5273)
  0.2 = coord(2/10)

Abstract: In text categorization tasks, classification on some class hierarchies has better results than in cases without the hierarchy. Currently, because a large number of documents are divided into several subgroups in a hierarchy, we can appropriately use a hierarchical classification method. However, we have no systematic method to build a hierarchical classification system that performs well with large collections of practical data. In this article, we introduce a new evaluation scheme for internal node classifiers, which can be used effectively to develop a hierarchical classification system. We also show that our method for constructing the hierarchical classification system is very effective, especially for the task of constructing classifiers applied to hierarchy tree with a lot of levels.
Date: 22. 7.2006 16:24:52
Source: Journal of the American Society for Information Science and Technology. 57(2006) no.3, S.431-442

Schiminovich, S.: Automatic classification and retrieval of documents by means of a bibliographic pattern discovery algorithm (1971) 0.02

0.015242942 = product of:
  0.07621471 = sum of:
    0.014659365 = weight(_text_:information in 4846) [ClassicSimilarity], result of:
      0.014659365 = score(doc=4846,freq=2.0), product of:
        0.05398669 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.030753274 = queryNorm
        0.27153665 = fieldWeight in 4846, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.109375 = fieldNorm(doc=4846)
    0.06155534 = weight(_text_:retrieval in 4846) [ClassicSimilarity], result of:
      0.06155534 = score(doc=4846,freq=4.0), product of:
        0.093026035 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.030753274 = queryNorm
        0.6617001 = fieldWeight in 4846, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.109375 = fieldNorm(doc=4846)
  0.2 = coord(2/10)

Source: Information storage and retrieval. 6(1971), S.417-435

Panyr, J.: Automatische Klassifikation und Information Retrieval : Anwendung und Entwicklung komplexer Verfahren in Information-Retrieval-Systemen und ihre Evaluierung (1986) 0.01

0.014905048 = product of:
  0.07452524 = sum of:
    0.021763513 = weight(_text_:information in 32) [ClassicSimilarity], result of:
      0.021763513 = score(doc=32,freq=6.0), product of:
        0.05398669 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.030753274 = queryNorm
        0.40312737 = fieldWeight in 32, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.09375 = fieldNorm(doc=32)
    0.052761722 = weight(_text_:retrieval in 32) [ClassicSimilarity], result of:
      0.052761722 = score(doc=32,freq=4.0), product of:
        0.093026035 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.030753274 = queryNorm
        0.5671716 = fieldWeight in 32, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.09375 = fieldNorm(doc=32)
  0.2 = coord(2/10)

Series: Sprache und Information; Bd.12

Jenkins, C.: Automatic classification of Web resources using Java and Dewey Decimal Classification (1998) 0.01

0.014013627 = product of:
  0.046712086 = sum of:
    0.010365736 = weight(_text_:information in 1673) [ClassicSimilarity], result of:
      0.010365736 = score(doc=1673,freq=4.0), product of:
        0.05398669 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.030753274 = queryNorm
        0.1920054 = fieldWeight in 1673, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1673)
    0.0217631 = weight(_text_:retrieval in 1673) [ClassicSimilarity], result of:
      0.0217631 = score(doc=1673,freq=2.0), product of:
        0.093026035 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.030753274 = queryNorm
        0.23394634 = fieldWeight in 1673, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1673)
    0.014583254 = product of:
      0.029166508 = sum of:
        0.029166508 = weight(_text_:22 in 1673) [ClassicSimilarity], result of:
          0.029166508 = score(doc=1673,freq=2.0), product of:
            0.107692726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.030753274 = queryNorm
            0.2708308 = fieldWeight in 1673, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1673)
      0.5 = coord(1/2)
  0.3 = coord(3/10)

Abstract: The Wolverhampton Web Library (WWLib) is a WWW search engine that provides access to UK based information. The experimental version developed in 1995, was a success but highlighted the need for a much higher degree of automation. An interesting feature of the experimental WWLib was that it organised information according to DDC. Discusses the advantages of classification and describes the automatic classifier that is being developed in Java as part of the new, fully automated WWLib
Date: 1. 8.1996 22:08:06
Theme: Klassifikationssysteme im Online-Retrieval

Search Engines and Beyond : Developing efficient knowledge management systems, April 19-20 1999, Boston, Mass (1999) 0.01
```
0.013706495 = product of:
  0.045688316 = sum of:
    0.0059232777 = weight(_text_:information in 2596) [ClassicSimilarity], result of:
      0.0059232777 = score(doc=2596,freq=4.0), product of:
        0.05398669 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.030753274 = queryNorm
        0.10971737 = fieldWeight in 2596, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.03125 = fieldNorm(doc=2596)
    0.027807869 = weight(_text_:retrieval in 2596) [ClassicSimilarity], result of:
      0.027807869 = score(doc=2596,freq=10.0), product of:
        0.093026035 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.030753274 = queryNorm
        0.29892567 = fieldWeight in 2596, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.03125 = fieldNorm(doc=2596)
    0.011957168 = product of:
      0.023914335 = sum of:
        0.023914335 = weight(_text_:evaluation in 2596) [ClassicSimilarity], result of:
          0.023914335 = score(doc=2596,freq=2.0), product of:
            0.12900078 = queryWeight, product of:
              4.1947007 = idf(docFreq=1811, maxDocs=44218)
              0.030753274 = queryNorm
            0.18538132 = fieldWeight in 2596, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.1947007 = idf(docFreq=1811, maxDocs=44218)
              0.03125 = fieldNorm(doc=2596)
      0.5 = coord(1/2)
  0.3 = coord(3/10)
```
Content

Ramana Rao (Inxight, Palo Alto, CA) 7 ± 2 Insights on achieving Effective Information Access Session One: Updates and a twelve month perspective Danny Sullivan (Search Engine Watch, US / England) Portalization and other search trends Carol Tenopir (University of Tennessee) Search realities faced by end users and professional searchers Session Two: Today's search engines and beyond Daniel Hoogterp (Retrieval Technologies, McLean, VA) Effective presentation and utilization of search techniques Rick Kenny (Fulcrum Technologies, Ontario, Canada) Beyond document clustering: The knowledge impact statement Gary Stock (Ingenius, Kalamazoo, MI) Automated change monitoring Gary Culliss (Direct Hit, Wellesley Hills, MA) User popularity ranked search engines Byron Dom (IBM, CA) Automatically finding the best pages on the World Wide Web (CLEVER) Peter Tomassi (LookSmart, San Francisco, CA) Adding human intellect to search technology Session Three: Panel discussion: Human v automated categorization and editing Ev Brenner (New York, NY)- Chairman James Callan (University of Massachusetts, MA) Marc Krellenstein (Northern Light Technology, Cambridge, MA) Dan Miller (Ask Jeeves, Berkeley, CA) Session Four: Updates and a twelve month perspective Steve Arnold (AIT, Harrods Creek, KY) Review: The leading edge in search and retrieval software Ellen Voorhees (NIST, Gaithersburg, MD) TREC update Session Five: Search engines now and beyond Intelligent Agents John Snyder (Muscat, Cambridge, England) Practical issues behind intelligent agents Text summarization Therese Firmin, (Dept of Defense, Ft George G. Meade, MD) The TIPSTER/SUMMAC evaluation of automatic text summarization systems Cross language searching Elizabeth Liddy (TextWise, Syracuse, NY) A conceptual interlingua approach to cross-language retrieval. Video search and retrieval Armon Amir (IBM, Almaden, CA) CueVideo: Modular system for automatic indexing and browsing of video/audio Speech recognition Michael Witbrock (Lycos, Waltham, MA) Retrieval of spoken documents Visualization James A. Wise (Integral Visuals, Richland, WA) Information visualization in the new millennium: Emerging science or passing fashion? Text mining David Evans (Claritech, Pittsburgh, PA) Text mining - towards decision support

Rijsbergen, C.J. van: Automatic classification in information retrieval (1978) 0.01

0.013299557 = product of:
  0.06649779 = sum of:
    0.01675356 = weight(_text_:information in 2412) [ClassicSimilarity], result of:
      0.01675356 = score(doc=2412,freq=2.0), product of:
        0.05398669 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.030753274 = queryNorm
        0.3103276 = fieldWeight in 2412, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.125 = fieldNorm(doc=2412)
    0.04974423 = weight(_text_:retrieval in 2412) [ClassicSimilarity], result of:
      0.04974423 = score(doc=2412,freq=2.0), product of:
        0.093026035 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.030753274 = queryNorm
        0.5347345 = fieldWeight in 2412, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.125 = fieldNorm(doc=2412)
  0.2 = coord(2/10)

Chung, Y.M.; Lee, J.Y.: ¬A corpus-based approach to comparative evaluation of statistical term association measures (2001) 0.01

0.012288752 = product of:
  0.040962506 = sum of:
    0.010470974 = weight(_text_:information in 5769) [ClassicSimilarity], result of:
      0.010470974 = score(doc=5769,freq=8.0), product of:
        0.05398669 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.030753274 = queryNorm
        0.19395474 = fieldWeight in 5769, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5769)
    0.015545071 = weight(_text_:retrieval in 5769) [ClassicSimilarity], result of:
      0.015545071 = score(doc=5769,freq=2.0), product of:
        0.093026035 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.030753274 = queryNorm
        0.16710453 = fieldWeight in 5769, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5769)
    0.01494646 = product of:
      0.02989292 = sum of:
        0.02989292 = weight(_text_:evaluation in 5769) [ClassicSimilarity], result of:
          0.02989292 = score(doc=5769,freq=2.0), product of:
            0.12900078 = queryWeight, product of:
              4.1947007 = idf(docFreq=1811, maxDocs=44218)
              0.030753274 = queryNorm
            0.23172665 = fieldWeight in 5769, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.1947007 = idf(docFreq=1811, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5769)
      0.5 = coord(1/2)
  0.3 = coord(3/10)

Abstract: Statistical association measures have been widely applied in information retrieval research, usually employing a clustering of documents or terms on the basis of their relationships. Applications of the association measures for term clustering include automatic thesaurus construction and query expansion. This research evaluates the similarity of six association measures by comparing the relationship and behavior they demonstrate in various analyses of a test corpus. Analysis techniques include comparisons of highly ranked term pairs and term clusters, analyses of the correlation among the association measures using Pearson's correlation coefficient and MDS mapping, and an analysis of the impact of a term frequency on the association values by means of z-score. The major findings of the study are as follows: First, the most similar association measures are mutual information and Yule's coefficient of colligation Y, whereas cosine and Jaccard coefficients, as well as X**2 statistic and likelihood ratio, demonstrate quite similar behavior for terms with high frequency. Second, among all the measures, the X**2 statistic is the least affected by the frequency of terms. Third, although cosine and Jaccard coefficients tend to emphasize high frequency terms, mutual information and Yule's Y seem to overestimate rare terms
Source: Journal of the American Society for Information Science and technology. 52(2001) no.4, S.283-296

Liu, R.-L.: ¬A passage extractor for classification of disease aspect information (2013) 0.01
```
0.011944045 = product of:
  0.03981348 = sum of:
    0.013851797 = weight(_text_:information in 1107) [ClassicSimilarity], result of:
      0.013851797 = score(doc=1107,freq=14.0), product of:
        0.05398669 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.030753274 = queryNorm
        0.256578 = fieldWeight in 1107, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1107)
    0.015545071 = weight(_text_:retrieval in 1107) [ClassicSimilarity], result of:
      0.015545071 = score(doc=1107,freq=2.0), product of:
        0.093026035 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.030753274 = queryNorm
        0.16710453 = fieldWeight in 1107, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1107)
    0.01041661 = product of:
      0.02083322 = sum of:
        0.02083322 = weight(_text_:22 in 1107) [ClassicSimilarity], result of:
          0.02083322 = score(doc=1107,freq=2.0), product of:
            0.107692726 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.030753274 = queryNorm
            0.19345059 = fieldWeight in 1107, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1107)
      0.5 = coord(1/2)
  0.3 = coord(3/10)
```
Abstract

Retrieval of disease information is often based on several key aspects such as etiology, diagnosis, treatment, prevention, and symptoms of diseases. Automatic identification of disease aspect information is thus essential. In this article, I model the aspect identification problem as a text classification (TC) problem in which a disease aspect corresponds to a category. The disease aspect classification problem poses two challenges to classifiers: (a) a medical text often contains information about multiple aspects of a disease and hence produces noise for the classifiers and (b) text classifiers often cannot extract the textual parts (i.e., passages) about the categories of interest. I thus develop a technique, PETC (Passage Extractor for Text Classification), that extracts passages (from medical texts) for the underlying text classifiers to classify. Case studies on thousands of Chinese and English medical texts show that PETC enhances a support vector machine (SVM) classifier in classifying disease aspect information. PETC also performs better than three state-of-the-art classifier enhancement techniques, including two passage extraction techniques for text classifiers and a technique that employs term proximity information to enhance text classifiers. The contribution is of significance to evidence-based medicine, health education, and healthcare decision support. PETC can be used in those application domains in which a text to be classified may have several parts about different categories.

Date

28.10.2013 19:22:57

Source

Journal of the American Society for Information Science and Technology. 64(2013) no.11, S.2265-2277
Golub, K.: Automated subject classification of textual web documents (2006) 0.01
```
0.011867899 = product of:
  0.039559662 = sum of:
    0.0090681305 = weight(_text_:information in 5600) [ClassicSimilarity], result of:
      0.0090681305 = score(doc=5600,freq=6.0), product of:
        0.05398669 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.030753274 = queryNorm
        0.16796975 = fieldWeight in 5600, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5600)
    0.015545071 = weight(_text_:retrieval in 5600) [ClassicSimilarity], result of:
      0.015545071 = score(doc=5600,freq=2.0), product of:
        0.093026035 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.030753274 = queryNorm
        0.16710453 = fieldWeight in 5600, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5600)
    0.01494646 = product of:
      0.02989292 = sum of:
        0.02989292 = weight(_text_:evaluation in 5600) [ClassicSimilarity], result of:
          0.02989292 = score(doc=5600,freq=2.0), product of:
            0.12900078 = queryWeight, product of:
              4.1947007 = idf(docFreq=1811, maxDocs=44218)
              0.030753274 = queryNorm
            0.23172665 = fieldWeight in 5600, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.1947007 = idf(docFreq=1811, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5600)
      0.5 = coord(1/2)
  0.3 = coord(3/10)
```
Abstract

Purpose - To provide an integrated perspective to similarities and differences between approaches to automated classification in different research communities (machine learning, information retrieval and library science), and point to problems with the approaches and automated classification as such. Design/methodology/approach - A range of works dealing with automated classification of full-text web documents are discussed. Explorations of individual approaches are given in the following sections: special features (description, differences, evaluation), application and characteristics of web pages. Findings - Provides major similarities and differences between the three approaches: document pre-processing and utilization of web-specific document characteristics is common to all the approaches; major differences are in applied algorithms, employment or not of the vector space model and of controlled vocabularies. Problems of automated classification are recognized. Research limitations/implications - The paper does not attempt to provide an exhaustive bibliography of related resources. Practical implications - As an integrated overview of approaches from different research communities with application examples, it is very useful for students in library and information science and computer science, as well as for practitioners. Researchers from one community have the information on how similar tasks are conducted in different communities. Originality/value - To the author's knowledge, no review paper on automated text classification attempted to discuss more than one community's approach from an integrated perspective.
Mu, T.; Goulermas, J.Y.; Korkontzelos, I.; Ananiadou, S.: Descriptive document clustering via discriminant learning in a co-embedded space of multilevel similarities (2016) 0.01
```
0.011754737 = product of:
  0.058773685 = sum of:
    0.0090681305 = weight(_text_:information in 2496) [ClassicSimilarity], result of:
      0.0090681305 = score(doc=2496,freq=6.0), product of:
        0.05398669 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.030753274 = queryNorm
        0.16796975 = fieldWeight in 2496, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2496)
    0.049705554 = weight(_text_:ranking in 2496) [ClassicSimilarity], result of:
      0.049705554 = score(doc=2496,freq=2.0), product of:
        0.16634533 = queryWeight, product of:
          5.4090285 = idf(docFreq=537, maxDocs=44218)
          0.030753274 = queryNorm
        0.29880944 = fieldWeight in 2496, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.4090285 = idf(docFreq=537, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2496)
  0.2 = coord(2/10)
```
Abstract

Descriptive document clustering aims at discovering clusters of semantically interrelated documents together with meaningful labels to summarize the content of each document cluster. In this work, we propose a novel descriptive clustering framework, referred to as CEDL. It relies on the formulation and generation of 2 types of heterogeneous objects, which correspond to documents and candidate phrases, using multilevel similarity information. CEDL is composed of 5 main processing stages. First, it simultaneously maps the documents and candidate phrases into a common co-embedded space that preserves higher-order, neighbor-based proximities between the combined sets of documents and phrases. Then, it discovers an approximate cluster structure of documents in the common space. The third stage extracts promising topic phrases by constructing a discriminant model where documents along with their cluster memberships are used as training instances. Subsequently, the final cluster labels are selected from the topic phrases using a ranking scheme using multiple scores based on the extracted co-embedding information and the discriminant output. The final stage polishes the initial clusters to reduce noise and accommodate the multitopic nature of documents. The effectiveness and competitiveness of CEDL is demonstrated qualitatively and quantitatively with experiments using document databases from different application fields.

Source

Journal of the Association for Information Science and Technology. 67(2016) no.1, S.106-133

Wu, M.; Fuller, M.; Wilkinson, R.: Using clustering and classification approaches in interactive retrieval (2001) 0.01

0.011637113 = product of:
  0.058185562 = sum of:
    0.014659365 = weight(_text_:information in 2666) [ClassicSimilarity], result of:
      0.014659365 = score(doc=2666,freq=2.0), product of:
        0.05398669 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.030753274 = queryNorm
        0.27153665 = fieldWeight in 2666, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.109375 = fieldNorm(doc=2666)
    0.0435262 = weight(_text_:retrieval in 2666) [ClassicSimilarity], result of:
      0.0435262 = score(doc=2666,freq=2.0), product of:
        0.093026035 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.030753274 = queryNorm
        0.46789268 = fieldWeight in 2666, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.109375 = fieldNorm(doc=2666)
  0.2 = coord(2/10)

Source: Information processing and management. 37(2001) no.3, S.459-484

Panyr, J.: Vektorraum-Modell und Clusteranalyse in Information-Retrieval-Systemen (1987) 0.01

0.011517756 = product of:
  0.05758878 = sum of:
    0.014509009 = weight(_text_:information in 2322) [ClassicSimilarity], result of:
      0.014509009 = score(doc=2322,freq=6.0), product of:
        0.05398669 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.030753274 = queryNorm
        0.2687516 = fieldWeight in 2322, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0625 = fieldNorm(doc=2322)
    0.043079767 = weight(_text_:retrieval in 2322) [ClassicSimilarity], result of:
      0.043079767 = score(doc=2322,freq=6.0), product of:
        0.093026035 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.030753274 = queryNorm
        0.46309367 = fieldWeight in 2322, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0625 = fieldNorm(doc=2322)
  0.2 = coord(2/10)

Abstract: Ausgehend von theoretischen Indexierungsansätzen wird das klassische Vektorraum-Modell für automatische Indexierung (mit dem Trennschärfen-Modell) erläutert. Das Clustering in Information-Retrieval-Systemem wird als eine natürliche logische Folge aus diesem Modell aufgefaßt und in allen seinen Ausprägungen (d.h. als Dokumenten-, Term- oder Dokumenten- und Termklassifikation) behandelt. Anschließend werden die Suchstrategien in vorklassifizierten Dokumentenbeständen (Clustersuche) detailliert beschrieben. Zum Schluß wird noch die sinnvolle Anwendung der Clusteranalyse in Information-Retrieval-Systemen kurz diskutiert

Borko, H.: Research in computer based classification systems (1985) 0.01
```
0.01115259 = product of:
  0.037175298 = sum of:
    0.0036648412 = weight(_text_:information in 3647) [ClassicSimilarity], result of:
      0.0036648412 = score(doc=3647,freq=2.0), product of:
        0.05398669 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.030753274 = queryNorm
        0.06788416 = fieldWeight in 3647, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.02734375 = fieldNorm(doc=3647)
    0.015388835 = weight(_text_:retrieval in 3647) [ClassicSimilarity], result of:
      0.015388835 = score(doc=3647,freq=4.0), product of:
        0.093026035 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.030753274 = queryNorm
        0.16542503 = fieldWeight in 3647, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.02734375 = fieldNorm(doc=3647)
    0.01812162 = product of:
      0.03624324 = sum of:
        0.03624324 = weight(_text_:evaluation in 3647) [ClassicSimilarity], result of:
          0.03624324 = score(doc=3647,freq=6.0), product of:
            0.12900078 = queryWeight, product of:
              4.1947007 = idf(docFreq=1811, maxDocs=44218)
              0.030753274 = queryNorm
            0.28095365 = fieldWeight in 3647, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.1947007 = idf(docFreq=1811, maxDocs=44218)
              0.02734375 = fieldNorm(doc=3647)
      0.5 = coord(1/2)
  0.3 = coord(3/10)
```
Abstract

The selection in this reader by R. M. Needham and K. Sparck Jones reports an early approach to automatic classification that was taken in England. The following selection reviews various approaches that were being pursued in the United States at about the same time. It then discusses a particular approach initiated in the early 1960s by Harold Borko, at that time Head of the Language Processing and Retrieval Research Staff at the System Development Corporation, Santa Monica, California and, since 1966, a member of the faculty at the Graduate School of Library and Information Science, University of California, Los Angeles. As was described earlier, there are two steps in automatic classification, the first being to identify pairs of terms that are similar by virtue of co-occurring as index terms in the same documents, and the second being to form equivalence classes of intersubstitutable terms. To compute similarities, Borko and his associates used a standard correlation formula; to derive classification categories, where Needham and Sparck Jones used clumping, the Borko team used the statistical technique of factor analysis. The fact that documents can be classified automatically, and in any number of ways, is worthy of passing notice. Worthy of serious attention would be a demonstra tion that a computer-based classification system was effective in the organization and retrieval of documents. One reason for the inclusion of the following selection in the reader is that it addresses the question of evaluation. To evaluate the effectiveness of their automatically derived classification, Borko and his team asked three questions. The first was Is the classification reliable? in other words, could the categories derived from one sample of texts be used to classify other texts? Reliability was assessed by a case-study comparison of the classes derived from three different samples of abstracts. The notso-surprising conclusion reached was that automatically derived classes were reliable only to the extent that the sample from which they were derived was representative of the total document collection. The second evaluation question asked whether the classification was reasonable, in the sense of adequately describing the content of the document collection. The answer was sought by comparing the automatically derived categories with categories in a related classification system that was manually constructed. Here the conclusion was that the automatic method yielded categories that fairly accurately reflected the major area of interest in the sample collection of texts; however, since there were only eleven such categories and they were quite broad, they could not be regarded as suitable for use in a university or any large general library. The third evaluation question asked whether automatic classification was accurate, in the sense of producing results similar to those obtainabie by human cIassifiers. When using human classification as a criterion, automatic classification was found to be 50 percent accurate.

Search (176 results, page 1 of 9)

Authors

Years

Languages

Types

Themes

Subjects