Search (102 results, page 1 of 6)

Liu, R.-L.: Context-based term frequency assessment for text classification (2010) 0.47

0.4729282 = sum of:
  0.0122329 = product of:
    0.0489316 = sum of:
      0.0489316 = weight(_text_:based in 3331) [ClassicSimilarity], result of:
        0.0489316 = score(doc=3331,freq=6.0), product of:
          0.14144066 = queryWeight, product of:
            3.0129938 = idf(docFreq=5906, maxDocs=44218)
            0.04694356 = queryNorm
          0.34595144 = fieldWeight in 3331, product of:
            2.4494898 = tf(freq=6.0), with freq of:
              6.0 = termFreq=6.0
            3.0129938 = idf(docFreq=5906, maxDocs=44218)
            0.046875 = fieldNorm(doc=3331)
    0.25 = coord(1/4)
  0.1916339 = weight(_text_:term in 3331) [ClassicSimilarity], result of:
    0.1916339 = score(doc=3331,freq=16.0), product of:
      0.21904005 = queryWeight, product of:
        4.66603 = idf(docFreq=1130, maxDocs=44218)
        0.04694356 = queryNorm
      0.8748806 = fieldWeight in 3331, product of:
        4.0 = tf(freq=16.0), with freq of:
          16.0 = termFreq=16.0
        4.66603 = idf(docFreq=1130, maxDocs=44218)
        0.046875 = fieldNorm(doc=3331)
  0.18691254 = weight(_text_:frequency in 3331) [ClassicSimilarity], result of:
    0.18691254 = score(doc=3331,freq=6.0), product of:
      0.27643865 = queryWeight, product of:
        5.888745 = idf(docFreq=332, maxDocs=44218)
        0.04694356 = queryNorm
      0.6761447 = fieldWeight in 3331, product of:
        2.4494898 = tf(freq=6.0), with freq of:
          6.0 = termFreq=6.0
        5.888745 = idf(docFreq=332, maxDocs=44218)
        0.046875 = fieldNorm(doc=3331)
  0.08214887 = product of:
    0.16429774 = sum of:
      0.16429774 = weight(_text_:assessment in 3331) [ClassicSimilarity], result of:
        0.16429774 = score(doc=3331,freq=6.0), product of:
          0.25917634 = queryWeight, product of:
            5.52102 = idf(docFreq=480, maxDocs=44218)
            0.04694356 = queryNorm
          0.63392264 = fieldWeight in 3331, product of:
            2.4494898 = tf(freq=6.0), with freq of:
              6.0 = termFreq=6.0
            5.52102 = idf(docFreq=480, maxDocs=44218)
            0.046875 = fieldNorm(doc=3331)
    0.5 = coord(1/2)

Abstract: Automatic text classification (TC) is essential for the management of information. To properly classify a document d, it is essential to identify the semantics of each term t in d, while the semantics heavily depend on context (neighboring terms) of t in d. Therefore, we present a technique CTFA (Context-based Term Frequency Assessment) that improves text classifiers by considering term contexts in test documents. The results of the term context recognition are used to assess term frequencies of terms, and hence CTFA may easily work with various kinds of text classifiers that base their TC decisions on term frequencies, without needing to modify the classifiers. Moreover, CTFA is efficient, and neither huge memory nor domain-specific knowledge is required. Empirical results show that CTFA successfully enhances performance of several kinds of text classifiers on different experimental data.
Object: Context-based Term Frequency Assessment

Chung, Y.M.; Lee, J.Y.: ¬A corpus-based approach to comparative evaluation of statistical term association measures (2001) 0.23

0.23399408 = product of:
  0.3119921 = sum of:
    0.005885557 = product of:
      0.023542227 = sum of:
        0.023542227 = weight(_text_:based in 5769) [ClassicSimilarity], result of:
          0.023542227 = score(doc=5769,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.16644597 = fieldWeight in 5769, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5769)
      0.25 = coord(1/4)
    0.12624991 = weight(_text_:term in 5769) [ClassicSimilarity], result of:
      0.12624991 = score(doc=5769,freq=10.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.5763782 = fieldWeight in 5769, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5769)
    0.17985666 = weight(_text_:frequency in 5769) [ClassicSimilarity], result of:
      0.17985666 = score(doc=5769,freq=8.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.6506205 = fieldWeight in 5769, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5769)
  0.75 = coord(3/4)

Abstract: Statistical association measures have been widely applied in information retrieval research, usually employing a clustering of documents or terms on the basis of their relationships. Applications of the association measures for term clustering include automatic thesaurus construction and query expansion. This research evaluates the similarity of six association measures by comparing the relationship and behavior they demonstrate in various analyses of a test corpus. Analysis techniques include comparisons of highly ranked term pairs and term clusters, analyses of the correlation among the association measures using Pearson's correlation coefficient and MDS mapping, and an analysis of the impact of a term frequency on the association values by means of z-score. The major findings of the study are as follows: First, the most similar association measures are mutual information and Yule's coefficient of colligation Y, whereas cosine and Jaccard coefficients, as well as X**2 statistic and likelihood ratio, demonstrate quite similar behavior for terms with high frequency. Second, among all the measures, the X**2 statistic is the least affected by the frequency of terms. Third, although cosine and Jaccard coefficients tend to emphasize high frequency terms, mutual information and Yule's Y seem to overestimate rare terms

Ko, Y.: ¬A new term-weighting scheme for text classification using the odds of positive and negative class probabilities (2015) 0.17
```
0.16593526 = product of:
  0.33187053 = sum of:
    0.17925708 = weight(_text_:term in 2339) [ClassicSimilarity], result of:
      0.17925708 = score(doc=2339,freq=14.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.8183758 = fieldWeight in 2339, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.046875 = fieldNorm(doc=2339)
    0.15261345 = weight(_text_:frequency in 2339) [ClassicSimilarity], result of:
      0.15261345 = score(doc=2339,freq=4.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.55206984 = fieldWeight in 2339, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.046875 = fieldNorm(doc=2339)
  0.5 = coord(2/4)
```
Abstract

Text classification (TC) is a core technique for text mining and information retrieval. It has been applied to many applications in many different research and industrial areas. Term-weighting schemes assign an appropriate weight to each term to obtain a high TC performance. Although term weighting is one of the important modules for TC and TC has different peculiarities from those in information retrieval, many term-weighting schemes used in information retrieval, such as term frequency-inverse document frequency (tf-idf), have been used in TC in the same manner. The peculiarity of TC that differs most from information retrieval is the existence of class information. This article proposes a new term-weighting scheme that uses class information using positive and negative class distributions. As a result, the proposed scheme, log tf-TRR, consistently performs better than do other schemes using class information as well as traditional schemes such as tf-idf.

AlQenaei, Z.M.; Monarchi, D.E.: ¬The use of learning techniques to analyze the results of a manual classification system (2016) 0.11

0.11420593 = product of:
  0.15227456 = sum of:
    0.005885557 = product of:
      0.023542227 = sum of:
        0.023542227 = weight(_text_:based in 2836) [ClassicSimilarity], result of:
          0.023542227 = score(doc=2836,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.16644597 = fieldWeight in 2836, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2836)
      0.25 = coord(1/4)
    0.056460675 = weight(_text_:term in 2836) [ClassicSimilarity], result of:
      0.056460675 = score(doc=2836,freq=2.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.25776416 = fieldWeight in 2836, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2836)
    0.08992833 = weight(_text_:frequency in 2836) [ClassicSimilarity], result of:
      0.08992833 = score(doc=2836,freq=2.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.32531026 = fieldWeight in 2836, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2836)
  0.75 = coord(3/4)

Abstract: Classification is the process of assigning objects to pre-defined classes based on observations or characteristics of those objects, and there are many approaches to performing this task. The overall objective of this study is to demonstrate the use of two learning techniques to analyze the results of a manual classification system. Our sample consisted of 1,026 documents, from the ACM Computing Classification System, classified by their authors as belonging to one of the groups of the classification system: "H.3 Information Storage and Retrieval." A singular value decomposition of the documents' weighted term-frequency matrix was used to represent each document in a 50-dimensional vector space. The analysis of the representation using both supervised (decision tree) and unsupervised (clustering) techniques suggests that two pairs of the ACM classes are closely related to each other in the vector space. Class 1 (Content Analysis and Indexing) is closely related to Class 3 (Information Search and Retrieval), and Class 4 (Systems and Software) is closely related to Class 5 (Online Information Services). Further analysis was performed to test the diffusion of the words in the two classes using both cosine and Euclidean distance.

Altinel, B.; Ganiz, M.C.: Semantic text classification : a survey of past and recent advances (2018) 0.11
```
0.10798191 = product of:
  0.14397588 = sum of:
    0.008155267 = product of:
      0.032621067 = sum of:
        0.032621067 = weight(_text_:based in 5051) [ClassicSimilarity], result of:
          0.032621067 = score(doc=5051,freq=6.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.2306343 = fieldWeight in 5051, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.03125 = fieldNorm(doc=5051)
      0.25 = coord(1/4)
    0.06387796 = weight(_text_:term in 5051) [ClassicSimilarity], result of:
      0.06387796 = score(doc=5051,freq=4.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.29162687 = fieldWeight in 5051, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.03125 = fieldNorm(doc=5051)
    0.071942665 = weight(_text_:frequency in 5051) [ClassicSimilarity], result of:
      0.071942665 = score(doc=5051,freq=2.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.2602482 = fieldWeight in 5051, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.03125 = fieldNorm(doc=5051)
  0.75 = coord(3/4)
```
Abstract

Automatic text classification is the task of organizing documents into pre-determined classes, generally using machine learning algorithms. Generally speaking, it is one of the most important methods to organize and make use of the gigantic amounts of information that exist in unstructured textual format. Text classification is a widely studied research area of language processing and text mining. In traditional text classification, a document is represented as a bag of words where the words in other words terms are cut from their finer context i.e. their location in a sentence or in a document. Only the broader context of document is used with some type of term frequency information in the vector space. Consequently, semantics of words that can be inferred from the finer context of its location in a sentence and its relations with neighboring words are usually ignored. However, meaning of words, semantic connections between words, documents and even classes are obviously important since methods that capture semantics generally reach better classification performances. Several surveys have been published to analyze diverse approaches for the traditional text classification methods. Most of these surveys cover application of different semantic term relatedness methods in text classification up to a certain degree. However, they do not specifically target semantic text classification algorithms and their advantages over the traditional text classification. In order to fill this gap, we undertake a comprehensive discussion of semantic text classification vs. traditional text classification. This survey explores the past and recent advancements in semantic text classification and attempts to organize existing approaches under five fundamental categories; domain knowledge-based approaches, corpus-based approaches, deep learning based approaches, word/character sequence enhanced approaches and linguistic enriched approaches. Furthermore, this survey highlights the advantages of semantic text classification algorithms over the traditional text classification algorithms.

Zhu, W.Z.; Allen, R.B.: Document clustering using the LSI subspace signature model (2013) 0.11

0.10762094 = product of:
  0.14349459 = sum of:
    0.0070626684 = product of:
      0.028250674 = sum of:
        0.028250674 = weight(_text_:based in 690) [ClassicSimilarity], result of:
          0.028250674 = score(doc=690,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.19973516 = fieldWeight in 690, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.046875 = fieldNorm(doc=690)
      0.25 = coord(1/4)
    0.117351316 = weight(_text_:term in 690) [ClassicSimilarity], result of:
      0.117351316 = score(doc=690,freq=6.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.5357528 = fieldWeight in 690, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.046875 = fieldNorm(doc=690)
    0.019080611 = product of:
      0.038161222 = sum of:
        0.038161222 = weight(_text_:22 in 690) [ClassicSimilarity], result of:
          0.038161222 = score(doc=690,freq=2.0), product of:
            0.16438834 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04694356 = queryNorm
            0.23214069 = fieldWeight in 690, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=690)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: We describe the latent semantic indexing subspace signature model (LSISSM) for semantic content representation of unstructured text. Grounded on singular value decomposition, the model represents terms and documents by the distribution signatures of their statistical contribution across the top-ranking latent concept dimensions. LSISSM matches term signatures with document signatures according to their mapping coherence between latent semantic indexing (LSI) term subspace and LSI document subspace. LSISSM does feature reduction and finds a low-rank approximation of scalable and sparse term-document matrices. Experiments demonstrate that this approach significantly improves the performance of major clustering algorithms such as standard K-means and self-organizing maps compared with the vector space model and the traditional LSI model. The unique contribution ranking mechanism in LSISSM also improves the initialization of standard K-means compared with random seeding procedure, which sometimes causes low efficiency and effectiveness of clustering. A two-stage initialization strategy based on LSISSM significantly reduces the running time of standard K-means procedures.
Date: 23. 3.2013 13:22:36

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.08

0.07544754 = product of:
  0.15089507 = sum of:
    0.13181446 = product of:
      0.26362893 = sum of:
        0.22367644 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
          0.22367644 = score(doc=562,freq=2.0), product of:
            0.39798802 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04694356 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.039952483 = weight(_text_:based in 562) [ClassicSimilarity], result of:
          0.039952483 = score(doc=562,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28246817 = fieldWeight in 562, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.5 = coord(2/4)
    0.019080611 = product of:
      0.038161222 = sum of:
        0.038161222 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
          0.038161222 = score(doc=562,freq=2.0), product of:
            0.16438834 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04694356 = queryNorm
            0.23214069 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.5 = coord(1/2)
  0.5 = coord(2/4)

Abstract: Document representations for text classification are typically based on the classical Bag-Of-Words paradigm. This approach comes with deficiencies that motivate the integration of features on a higher semantic level than single words. In this paper we propose an enhancement of the classical document representation through concepts extracted from background knowledge. Boosting is used for actual classification. Experimental evaluations on two well known text corpora support our approach through consistent improvement of the results.
Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32

Liu, R.-L.: ¬A passage extractor for classification of disease aspect information (2013) 0.06

0.060513463 = product of:
  0.08068462 = sum of:
    0.008323434 = product of:
      0.033293735 = sum of:
        0.033293735 = weight(_text_:based in 1107) [ClassicSimilarity], result of:
          0.033293735 = score(doc=1107,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.23539014 = fieldWeight in 1107, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1107)
      0.25 = coord(1/4)
    0.056460675 = weight(_text_:term in 1107) [ClassicSimilarity], result of:
      0.056460675 = score(doc=1107,freq=2.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.25776416 = fieldWeight in 1107, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1107)
    0.015900511 = product of:
      0.031801023 = sum of:
        0.031801023 = weight(_text_:22 in 1107) [ClassicSimilarity], result of:
          0.031801023 = score(doc=1107,freq=2.0), product of:
            0.16438834 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04694356 = queryNorm
            0.19345059 = fieldWeight in 1107, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1107)
      0.5 = coord(1/2)
  0.75 = coord(3/4)

Abstract: Retrieval of disease information is often based on several key aspects such as etiology, diagnosis, treatment, prevention, and symptoms of diseases. Automatic identification of disease aspect information is thus essential. In this article, I model the aspect identification problem as a text classification (TC) problem in which a disease aspect corresponds to a category. The disease aspect classification problem poses two challenges to classifiers: (a) a medical text often contains information about multiple aspects of a disease and hence produces noise for the classifiers and (b) text classifiers often cannot extract the textual parts (i.e., passages) about the categories of interest. I thus develop a technique, PETC (Passage Extractor for Text Classification), that extracts passages (from medical texts) for the underlying text classifiers to classify. Case studies on thousands of Chinese and English medical texts show that PETC enhances a support vector machine (SVM) classifier in classifying disease aspect information. PETC also performs better than three state-of-the-art classifier enhancement techniques, including two passage extraction techniques for text classifiers and a technique that employs term proximity information to enhance text classifiers. The contribution is of significance to evidence-based medicine, health education, and healthcare decision support. PETC can be used in those application domains in which a text to be classified may have several parts about different categories.
Date: 28.10.2013 19:22:57

Chung, Y.-M.; Noh, Y.-H.: Developing a specialized directory system by automatically classifying Web documents (2003) 0.05
```
0.052902535 = product of:
  0.10580507 = sum of:
    0.009988121 = product of:
      0.039952483 = sum of:
        0.039952483 = weight(_text_:based in 1566) [ClassicSimilarity], result of:
          0.039952483 = score(doc=1566,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28246817 = fieldWeight in 1566, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.046875 = fieldNorm(doc=1566)
      0.25 = coord(1/4)
    0.09581695 = weight(_text_:term in 1566) [ClassicSimilarity], result of:
      0.09581695 = score(doc=1566,freq=4.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.4374403 = fieldWeight in 1566, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.046875 = fieldNorm(doc=1566)
  0.5 = coord(2/4)
```
Abstract

This study developed a specialized directory system using an automatic classification technique. Economics was selected as the subject field for the classification experiments with Web documents. The classification scheme of the directory follows the DDC, and subject terms representing each class number or subject category were selected from the DDC table to construct a representative term dictionary. In collecting and classifying the Web documents, various strategies were tested in order to find the optimal thresholds. In the classification experiments, Web documents in economics were classified into a total of 757 hierarchical subject categories built from the DDC scheme. The first and second experiments using the representative term dictionary resulted in relatively high precision ratios of 77 and 60%, respectively. The third experiment employing a machine learning-based k-nearest neighbours (kNN) classifier in a closed experimental setting achieved a precision ratio of 96%. This implies that it is possible to enhance the classification performance by applying a hybrid method combining a dictionary-based technique and a kNN classifier

Golub, K.: Automated subject classification of textual Web pages, based on a controlled vocabulary : challenges and recommendations (2006) 0.05

0.052902535 = product of:
  0.10580507 = sum of:
    0.009988121 = product of:
      0.039952483 = sum of:
        0.039952483 = weight(_text_:based in 5897) [ClassicSimilarity], result of:
          0.039952483 = score(doc=5897,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28246817 = fieldWeight in 5897, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.046875 = fieldNorm(doc=5897)
      0.25 = coord(1/4)
    0.09581695 = weight(_text_:term in 5897) [ClassicSimilarity], result of:
      0.09581695 = score(doc=5897,freq=4.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.4374403 = fieldWeight in 5897, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.046875 = fieldNorm(doc=5897)
  0.5 = coord(2/4)

Abstract: The primary objective of this study was to identify and address problems of applying a controlled vocabulary in automated subject classification of textual Web pages, in the area of engineering. Web pages have special characteristics such as structural information, but are at the same time rather heterogeneous. The classification approach used comprises string-to-string matching between words in a term list extracted from the Ei (Engineering Information) thesaurus and classification scheme, and words in the text to be classified. Based on a sample of 70 Web pages, a number of problems with the term list are identified. Reasons for those problems are discussed and improvements proposed. Methods for implementing the improvements are also specified, suggesting further research.

Wang, H.; Hong, M.: Supervised Hebb rule based feature selection for text classification (2019) 0.05
```
0.045020774 = product of:
  0.09004155 = sum of:
    0.010194084 = product of:
      0.040776335 = sum of:
        0.040776335 = weight(_text_:based in 5036) [ClassicSimilarity], result of:
          0.040776335 = score(doc=5036,freq=6.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28829288 = fieldWeight in 5036, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5036)
      0.25 = coord(1/4)
    0.07984746 = weight(_text_:term in 5036) [ClassicSimilarity], result of:
      0.07984746 = score(doc=5036,freq=4.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.3645336 = fieldWeight in 5036, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5036)
  0.5 = coord(2/4)
```
Abstract

Text documents usually contain high dimensional non-discriminative (irrelevant and noisy) terms which lead to steep computational costs and poor learning performance of text classification. One of the effective solutions for this problem is feature selection which aims to identify discriminative terms from text data. This paper proposes a method termed "Hebb rule based feature selection (HRFS)". HRFS is based on supervised Hebb rule and assumes that terms and classes are neurons and select terms under the assumption that a term is discriminative if it keeps "exciting" the corresponding classes. This assumption can be explained as "a term is highly correlated with a class if it is able to keep "exciting" the class according to the original Hebb postulate. Six benchmarking datasets are used to compare HRFS with other seven feature selection methods. Experimental results indicate that HRFS is effective to achieve better performance than the compared methods. HRFS can identify discriminative terms in the view of synapse between neurons. Moreover, HRFS is also efficient because it can be described in the view of matrix operation to decrease complexity of feature selection.

Miyamoto, S.: Information clustering based an fuzzy multisets (2003) 0.04

0.043642364 = product of:
  0.08728473 = sum of:
    0.00823978 = product of:
      0.03295912 = sum of:
        0.03295912 = weight(_text_:based in 1071) [ClassicSimilarity], result of:
          0.03295912 = score(doc=1071,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.23302436 = fieldWeight in 1071, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1071)
      0.25 = coord(1/4)
    0.079044946 = weight(_text_:term in 1071) [ClassicSimilarity], result of:
      0.079044946 = score(doc=1071,freq=2.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.36086982 = fieldWeight in 1071, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1071)
  0.5 = coord(2/4)

Abstract: A fuzzy multiset model for information clustering is proposed with application to information retrieval on the World Wide Web. Noting that a search engine retrieves multiple occurrences of the same subjects with possibly different degrees of relevance, we observe that fuzzy multisets provide an appropriate model of information retrieval on the WWW. Information clustering which means both term clustering and document clustering is considered. Three methods of the hard c-means, fuzzy c-means, and an agglomerative method using cluster centers are proposed. Two distances between fuzzy multisets and algorithms for calculating cluster centers are defined. Theoretical properties concerning the clustering algorithms are studied. Illustrative examples are given to show how the algorithms work.

Golub, K.; Hamon, T.; Ardö, A.: Automated classification of textual documents based on a controlled vocabulary in engineering (2007) 0.04
```
0.038870465 = product of:
  0.07774093 = sum of:
    0.009988121 = product of:
      0.039952483 = sum of:
        0.039952483 = weight(_text_:based in 1461) [ClassicSimilarity], result of:
          0.039952483 = score(doc=1461,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.28246817 = fieldWeight in 1461, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.046875 = fieldNorm(doc=1461)
      0.25 = coord(1/4)
    0.06775281 = weight(_text_:term in 1461) [ClassicSimilarity], result of:
      0.06775281 = score(doc=1461,freq=2.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.309317 = fieldWeight in 1461, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.046875 = fieldNorm(doc=1461)
  0.5 = coord(2/4)
```
Abstract

Automated subject classification has been a challenging research issue for many years now, receiving particular attention in the past decade due to rapid increase of digital documents. The most frequent approach to automated classification is machine learning. It, however, requires training documents and performs well on new documents only if these are similar enough to the former. We explore a string-matching algorithm based on a controlled vocabulary, which does not require training documents - instead it reuses the intellectual work put into creating the controlled vocabulary. Terms from the Engineering Information thesaurus and classification scheme were matched against title and abstract of engineering papers from the Compendex database. Simple string-matching was enhanced by several methods such as term weighting schemes and cut-offs, exclusion of certain terms, and en- richment of the controlled vocabulary with automatically extracted terms. The best results are 76% recall when the controlled vocabulary is enriched with new terms, and 79% precision when certain terms are excluded. Precision of individual classes is up to 98%. These results are comparable to state-of-the-art machine-learning algorithms.
Kwon, O.W.; Lee, J.H.: Text categorization based on k-nearest neighbor approach for web site classification (2003) 0.03
```
0.032392055 = product of:
  0.06478411 = sum of:
    0.008323434 = product of:
      0.033293735 = sum of:
        0.033293735 = weight(_text_:based in 1070) [ClassicSimilarity], result of:
          0.033293735 = score(doc=1070,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.23539014 = fieldWeight in 1070, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1070)
      0.25 = coord(1/4)
    0.056460675 = weight(_text_:term in 1070) [ClassicSimilarity], result of:
      0.056460675 = score(doc=1070,freq=2.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.25776416 = fieldWeight in 1070, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1070)
  0.5 = coord(2/4)
```
Abstract

Automatic categorization is a viable method to deal with the scaling problem on the World Wide Web. For Web site classification, this paper proposes the use of Web pages linked with the home page in a different manner from the sole use of home pages in previous research. To implement our proposed method, we derive a scheme for Web site classification based on the k-nearest neighbor (k-NN) approach. It consists of three phases: Web page selection (connectivity analysis), Web page classification, and Web site classification. Given a Web site, the Web page selection chooses several representative Web pages using connectivity analysis. The k-NN classifier next classifies each of the selected Web pages. Finally, the classified Web pages are extended to a classification of the entire Web site. To improve performance, we supplement the k-NN approach with a feature selection method and a term weighting scheme using markup tags, and also reform its document-document similarity measure. In our experiments on a Korean commercial Web directory, the proposed system, using both a home page and its linked pages, improved the performance of micro-averaging breakeven point by 30.02%, compared with an ordinary classification which uses a home page only.
Golub, K.; Lykke, M.: Automated classification of web pages in hierarchical browsing (2009) 0.03
```
0.032392055 = product of:
  0.06478411 = sum of:
    0.008323434 = product of:
      0.033293735 = sum of:
        0.033293735 = weight(_text_:based in 3614) [ClassicSimilarity], result of:
          0.033293735 = score(doc=3614,freq=4.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.23539014 = fieldWeight in 3614, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3614)
      0.25 = coord(1/4)
    0.056460675 = weight(_text_:term in 3614) [ClassicSimilarity], result of:
      0.056460675 = score(doc=3614,freq=2.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.25776416 = fieldWeight in 3614, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3614)
  0.5 = coord(2/4)
```
Abstract

Purpose - The purpose of this study is twofold: to investigate whether it is meaningful to use the Engineering Index (Ei) classification scheme for browsing, and then, if proven useful, to investigate the performance of an automated classification algorithm based on the Ei classification scheme. Design/methodology/approach - A user study was conducted in which users solved four controlled searching tasks. The users browsed the Ei classification scheme in order to examine the suitability of the classification systems for browsing. The classification algorithm was evaluated by the users who judged the correctness of the automatically assigned classes. Findings - The study showed that the Ei classification scheme is suited for browsing. Automatically assigned classes were on average partly correct, with some classes working better than others. Success of browsing showed to be correlated and dependent on classification correctness. Research limitations/implications - Further research should address problems of disparate evaluations of one and the same web page. Additional reasons behind browsing failures in the Ei classification scheme also need further investigation. Practical implications - Improvements for browsing were identified: describing class captions and/or listing their subclasses from start; allowing for searching for words from class captions with synonym search (easily provided for Ei since the classes are mapped to thesauri terms); when searching for class captions, returning the hierarchical tree expanded around the class in which caption the search term is found. The need for improvements of classification schemes was also indicated. Originality/value - A user-based evaluation of automated subject classification in the context of browsing has not been conducted before; hence the study also presents new findings concerning methodology.
Yang, Y.; Liu, X.: ¬A re-examination of text categorization methods (1999) 0.03
```
0.031474918 = product of:
  0.12589967 = sum of:
    0.12589967 = weight(_text_:frequency in 3386) [ClassicSimilarity], result of:
      0.12589967 = score(doc=3386,freq=2.0), product of:
        0.27643865 = queryWeight, product of:
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.04694356 = queryNorm
        0.45543438 = fieldWeight in 3386, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.888745 = idf(docFreq=332, maxDocs=44218)
          0.0546875 = fieldNorm(doc=3386)
  0.25 = coord(1/4)
```
Abstract

This paper reports a controlled study with statistical significance tests an five text categorization methods: the Support Vector Machines (SVM), a k-Nearest Neighbor (kNN) classifier, a neural network (NNet) approach, the Linear Leastsquares Fit (LLSF) mapping and a Naive Bayes (NB) classifier. We focus an the robustness of these methods in dealing with a skewed category distribution, and their performance as function of the training-set category frequency. Our results show that SVM, kNN and LLSF significantly outperform NNet and NB when the number of positive training instances per category are small (less than ten, and that all the methods perform comparably when the categories are sufficiently common (over 300 instances).
Ru, C.; Tang, J.; Li, S.; Xie, S.; Wang, T.: Using semantic similarity to reduce wrong labels in distant supervision for relation extraction (2018) 0.03
```
0.031173116 = product of:
  0.06234623 = sum of:
    0.005885557 = product of:
      0.023542227 = sum of:
        0.023542227 = weight(_text_:based in 5055) [ClassicSimilarity], result of:
          0.023542227 = score(doc=5055,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.16644597 = fieldWeight in 5055, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5055)
      0.25 = coord(1/4)
    0.056460675 = weight(_text_:term in 5055) [ClassicSimilarity], result of:
      0.056460675 = score(doc=5055,freq=2.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.25776416 = fieldWeight in 5055, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5055)
  0.5 = coord(2/4)
```
Abstract

Distant supervision (DS) has the advantage of automatically generating large amounts of labelled training data and has been widely used for relation extraction. However, there are usually many wrong labels in the automatically labelled data in distant supervision (Riedel, Yao, & McCallum, 2010). This paper presents a novel method to reduce the wrong labels. The proposed method uses the semantic Jaccard with word embedding to measure the semantic similarity between the relation phrase in the knowledge base and the dependency phrases between two entities in a sentence to filter the wrong labels. In the process of reducing wrong labels, the semantic Jaccard algorithm selects a core dependency phrase to represent the candidate relation in a sentence, which can capture features for relation classification and avoid the negative impact from irrelevant term sequences that previous neural network models of relation extraction often suffer. In the process of relation classification, the core dependency phrases are also used as the input of a convolutional neural network (CNN) for relation classification. The experimental results show that compared with the methods using original DS data, the methods using filtered DS data performed much better in relation extraction. It indicates that the semantic similarity based method is effective in reducing wrong labels. The relation extraction performance of the CNN model using the core dependency phrases as input is the best of all, which indicates that using the core dependency phrases as input of CNN is enough to capture the features for relation classification and could avoid negative impact from irrelevant terms.
Panyr, J.: Vektorraum-Modell und Clusteranalyse in Information-Retrieval-Systemen (1987) 0.02
```
0.02258427 = product of:
  0.09033708 = sum of:
    0.09033708 = weight(_text_:term in 2322) [ClassicSimilarity], result of:
      0.09033708 = score(doc=2322,freq=2.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.41242266 = fieldWeight in 2322, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0625 = fieldNorm(doc=2322)
  0.25 = coord(1/4)
```
Abstract

Ausgehend von theoretischen Indexierungsansätzen wird das klassische Vektorraum-Modell für automatische Indexierung (mit dem Trennschärfen-Modell) erläutert. Das Clustering in Information-Retrieval-Systemem wird als eine natürliche logische Folge aus diesem Modell aufgefaßt und in allen seinen Ausprägungen (d.h. als Dokumenten-, Term- oder Dokumenten- und Termklassifikation) behandelt. Anschließend werden die Suchstrategien in vorklassifizierten Dokumentenbeständen (Clustersuche) detailliert beschrieben. Zum Schluß wird noch die sinnvolle Anwendung der Clusteranalyse in Information-Retrieval-Systemen kurz diskutiert

HaCohen-Kerner, Y. et al.: Classification using various machine learning methods and combinations of key-phrases and visual features (2016) 0.02

0.021786068 = product of:
  0.043572135 = sum of:
    0.011771114 = product of:
      0.047084454 = sum of:
        0.047084454 = weight(_text_:based in 2748) [ClassicSimilarity], result of:
          0.047084454 = score(doc=2748,freq=2.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.33289194 = fieldWeight in 2748, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.078125 = fieldNorm(doc=2748)
      0.25 = coord(1/4)
    0.031801023 = product of:
      0.063602045 = sum of:
        0.063602045 = weight(_text_:22 in 2748) [ClassicSimilarity], result of:
          0.063602045 = score(doc=2748,freq=2.0), product of:
            0.16438834 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04694356 = queryNorm
            0.38690117 = fieldWeight in 2748, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=2748)
      0.5 = coord(1/2)
  0.5 = coord(2/4)

Date: 1. 2.2016 18:25:22
Source: Semantic keyword-based search on structured data sources: First COST Action IC1302 International KEYSTONE Conference, IKC 2015, Coimbra, Portugal, September 8-9, 2015. Revised Selected Papers. Eds.: J. Cardoso et al

Dolin, R.; Agrawal, D.; El Abbadi, A.; Pearlman, J.: Using automated classification for summarizing and selecting heterogeneous information sources (1998) 0.02
```
0.020469537 = product of:
  0.040939074 = sum of:
    0.0070626684 = product of:
      0.028250674 = sum of:
        0.028250674 = weight(_text_:based in 1253) [ClassicSimilarity], result of:
          0.028250674 = score(doc=1253,freq=8.0), product of:
            0.14144066 = queryWeight, product of:
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.04694356 = queryNorm
            0.19973516 = fieldWeight in 1253, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.0129938 = idf(docFreq=5906, maxDocs=44218)
              0.0234375 = fieldNorm(doc=1253)
      0.25 = coord(1/4)
    0.033876404 = weight(_text_:term in 1253) [ClassicSimilarity], result of:
      0.033876404 = score(doc=1253,freq=2.0), product of:
        0.21904005 = queryWeight, product of:
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.04694356 = queryNorm
        0.1546585 = fieldWeight in 1253, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.66603 = idf(docFreq=1130, maxDocs=44218)
          0.0234375 = fieldNorm(doc=1253)
  0.5 = coord(2/4)
```
Abstract

Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. Important sources of information include not only traditional databases with structured data and queries, but also increasing numbers of non-traditional, semi- or unstructured collections such as Web sites, FTP archives, etc. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC), within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. For such a system to be useful in a large-scale distributed environment, it must be easy to use for both collection managers and users. As a result, it must be possible to classify documents automatically within a classification scheme. Furthermore, there must be a straightforward and intuitive interface with which the user may use the scheme to assist in information retrieval (IR). Our work with the Alexandria Digital Library (ADL) Project focuses on geo-referenced information, whether text, maps, aerial photographs, or satellite images. As a result, we have emphasized techniques which work with both text and non-text, such as combined textual and graphical queries, multi-dimensional indexing, and IR methods which are not solely dependent on words or phrases. Part of this work involves locating relevant online sources of information. In particular, we have designed and are currently testing aspects of an architecture, Pharos, which we believe will scale up to 1.000.000 heterogeneous sources. Pharos accommodates heterogeneity in content and format, both among multiple sources as well as within a single source. That is, we consider sources to include Web sites, FTP archives, newsgroups, and full digital libraries; all of these systems can include a wide variety of content and multimedia data formats. Pharos is based on the use of hierarchical classification schemes. These include not only well-known 'subject' (or 'concept') based schemes such as the Dewey Decimal System and the LCC, but also, for example, geographic classifications, which might be constructed as layers of smaller and smaller hierarchical longitude/latitude boxes. Pharos is designed to work with sophisticated queries which utilize subjects, geographical locations, temporal specifications, and other types of information domains. The Pharos architecture requires that hierarchically structured collection metadata be extracted so that it can be partitioned in such a way as to greatly enhance scalability. Automated classification is important to Pharos because it allows information sources to extract the requisite collection metadata automatically that must be distributed.
We are currently experimenting with newsgroups as collections. We have built an initial prototype which automatically classifies and summarizes newsgroups within the LCC. (The prototype can be tested below, and more details may be found at http://pharos.alexandria.ucsb.edu/). The prototype uses electronic library catalog records as a `training set' and Latent Semantic Indexing (LSI) for IR. We use the training set to build a rich set of classification terminology, and associate these terms with the relevant categories in the LCC. This association between terms and classification categories allows us to relate users' queries to nodes in the LCC so that users can select appropriate query categories. Newsgroups are similarly associated with classification categories. Pharos then matches the categories selected by users to relevant newsgroups. In principle, this approach allows users to exclude newsgroups that might have been selected based on an unintended meaning of a query term, and to include newsgroups with relevant content even though the exact query terms may not have been used. This work is extensible to other types of classification, including geographical, temporal, and image feature. Before discussing the methodology of the collection summarization and selection, we first present an online demonstration below. The demonstration is not intended to be a complete end-user interface. Rather, it is intended merely to offer a view of the process to suggest the "look and feel" of the prototype. The demo works as follows. First supply it with a few keywords of interest. The system will then use those terms to try to return to you the most relevant subject categories within the LCC. Assuming that the system recognizes any of your terms (it has over 400,000 terms indexed), it will give you a list of 15 LCC categories sorted by relevancy ranking. From there, you have two choices. The first choice, by clicking on the "News" links, is to get a list of newsgroups which the system has identified as relevant to the LCC category you select. The other choice, by clicking on the LCC ID links, is to enter the LCC hierarchy starting at the category of your choice and navigate the tree until you locate the best category for your query. From there, again, you can get a list of newsgroups by clicking on the "News" links. After having shown this demonstration to many people, we would like to suggest that you first give it easier examples before trying to break it. For example, "prostate cancer" (discussed below), "remote sensing", "investment banking", and "gershwin" all work reasonably well.

Search (102 results, page 1 of 6)

Authors

Years

Languages

Types

Themes