Search (39 results, page 1 of 2)

Liu, R.-L.: ¬A passage extractor for classification of disease aspect information (2013) 0.04
```
0.03849499 = product of:
  0.07698998 = sum of:
    0.02143378 = weight(_text_:information in 1107) [ClassicSimilarity], result of:
      0.02143378 = score(doc=1107,freq=14.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.256578 = fieldWeight in 1107, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1107)
    0.055556197 = sum of:
      0.02331961 = weight(_text_:technology in 1107) [ClassicSimilarity], result of:
        0.02331961 = score(doc=1107,freq=2.0), product of:
          0.1417311 = queryWeight, product of:
            2.978387 = idf(docFreq=6114, maxDocs=44218)
            0.047586527 = queryNorm
          0.16453418 = fieldWeight in 1107, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            2.978387 = idf(docFreq=6114, maxDocs=44218)
            0.0390625 = fieldNorm(doc=1107)
      0.032236587 = weight(_text_:22 in 1107) [ClassicSimilarity], result of:
        0.032236587 = score(doc=1107,freq=2.0), product of:
          0.16663991 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.047586527 = queryNorm
          0.19345059 = fieldWeight in 1107, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0390625 = fieldNorm(doc=1107)
  0.5 = coord(2/4)
```
Abstract

Retrieval of disease information is often based on several key aspects such as etiology, diagnosis, treatment, prevention, and symptoms of diseases. Automatic identification of disease aspect information is thus essential. In this article, I model the aspect identification problem as a text classification (TC) problem in which a disease aspect corresponds to a category. The disease aspect classification problem poses two challenges to classifiers: (a) a medical text often contains information about multiple aspects of a disease and hence produces noise for the classifiers and (b) text classifiers often cannot extract the textual parts (i.e., passages) about the categories of interest. I thus develop a technique, PETC (Passage Extractor for Text Classification), that extracts passages (from medical texts) for the underlying text classifiers to classify. Case studies on thousands of Chinese and English medical texts show that PETC enhances a support vector machine (SVM) classifier in classifying disease aspect information. PETC also performs better than three state-of-the-art classifier enhancement techniques, including two passage extraction techniques for text classifiers and a technique that employs term proximity information to enhance text classifiers. The contribution is of significance to evidence-based medicine, health education, and healthcare decision support. PETC can be used in those application domains in which a text to be classified may have several parts about different categories.

Date

28.10.2013 19:22:57

Source

Journal of the American Society for Information Science and Technology. 64(2013) no.11, S.2265-2277

Zhu, W.Z.; Allen, R.B.: Document clustering using the LSI subspace signature model (2013) 0.04

0.038194444 = product of:
  0.07638889 = sum of:
    0.00972145 = weight(_text_:information in 690) [ClassicSimilarity], result of:
      0.00972145 = score(doc=690,freq=2.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.116372846 = fieldWeight in 690, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=690)
    0.06666744 = sum of:
      0.027983533 = weight(_text_:technology in 690) [ClassicSimilarity], result of:
        0.027983533 = score(doc=690,freq=2.0), product of:
          0.1417311 = queryWeight, product of:
            2.978387 = idf(docFreq=6114, maxDocs=44218)
            0.047586527 = queryNorm
          0.19744103 = fieldWeight in 690, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            2.978387 = idf(docFreq=6114, maxDocs=44218)
            0.046875 = fieldNorm(doc=690)
      0.038683902 = weight(_text_:22 in 690) [ClassicSimilarity], result of:
        0.038683902 = score(doc=690,freq=2.0), product of:
          0.16663991 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.047586527 = queryNorm
          0.23214069 = fieldWeight in 690, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.046875 = fieldNorm(doc=690)
  0.5 = coord(2/4)

Date: 23. 3.2013 13:22:36
Source: Journal of the American Society for Information Science and Technology. 64(2013) no.4, S.844-860

Egbert, J.; Biber, D.; Davies, M.: Developing a bottom-up, user-based method of web register classification (2015) 0.04

0.038194444 = product of:
  0.07638889 = sum of:
    0.00972145 = weight(_text_:information in 2158) [ClassicSimilarity], result of:
      0.00972145 = score(doc=2158,freq=2.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.116372846 = fieldWeight in 2158, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=2158)
    0.06666744 = sum of:
      0.027983533 = weight(_text_:technology in 2158) [ClassicSimilarity], result of:
        0.027983533 = score(doc=2158,freq=2.0), product of:
          0.1417311 = queryWeight, product of:
            2.978387 = idf(docFreq=6114, maxDocs=44218)
            0.047586527 = queryNorm
          0.19744103 = fieldWeight in 2158, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            2.978387 = idf(docFreq=6114, maxDocs=44218)
            0.046875 = fieldNorm(doc=2158)
      0.038683902 = weight(_text_:22 in 2158) [ClassicSimilarity], result of:
        0.038683902 = score(doc=2158,freq=2.0), product of:
          0.16663991 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.047586527 = queryNorm
          0.23214069 = fieldWeight in 2158, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.046875 = fieldNorm(doc=2158)
  0.5 = coord(2/4)

Date: 4. 8.2015 19:22:04
Source: Journal of the Association for Information Science and Technology. 66(2015) no.9, S.1817-1831

Ko, Y.: ¬A new term-weighting scheme for text classification using the odds of positive and negative class probabilities (2015) 0.02

0.020744089 = product of:
  0.041488178 = sum of:
    0.02749641 = weight(_text_:information in 2339) [ClassicSimilarity], result of:
      0.02749641 = score(doc=2339,freq=16.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.3291521 = fieldWeight in 2339, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=2339)
    0.013991767 = product of:
      0.027983533 = sum of:
        0.027983533 = weight(_text_:technology in 2339) [ClassicSimilarity], result of:
          0.027983533 = score(doc=2339,freq=2.0), product of:
            0.1417311 = queryWeight, product of:
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.047586527 = queryNorm
            0.19744103 = fieldWeight in 2339, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.046875 = fieldNorm(doc=2339)
      0.5 = coord(1/2)
  0.5 = coord(2/4)

Abstract: Text classification (TC) is a core technique for text mining and information retrieval. It has been applied to many applications in many different research and industrial areas. Term-weighting schemes assign an appropriate weight to each term to obtain a high TC performance. Although term weighting is one of the important modules for TC and TC has different peculiarities from those in information retrieval, many term-weighting schemes used in information retrieval, such as term frequency-inverse document frequency (tf-idf), have been used in TC in the same manner. The peculiarity of TC that differs most from information retrieval is the existence of class information. This article proposes a new term-weighting scheme that uses class information using positive and negative class distributions. As a result, the proposed scheme, log tf-TRR, consistently performs better than do other schemes using class information as well as traditional schemes such as tf-idf.
Source: Journal of the Association for Information Science and Technology. 66(2015) no.12, S.2553-2565

Chae, G.; Park, J.; Park, J.; Yeo, W.S.; Shi, C.: Linking and clustering artworks using social tags : revitalizing crowd-sourced information on cultural collections (2016) 0.01
```
0.01393111 = product of:
  0.02786222 = sum of:
    0.016202414 = weight(_text_:information in 2852) [ClassicSimilarity], result of:
      0.016202414 = score(doc=2852,freq=8.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.19395474 = fieldWeight in 2852, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2852)
    0.011659805 = product of:
      0.02331961 = sum of:
        0.02331961 = weight(_text_:technology in 2852) [ClassicSimilarity], result of:
          0.02331961 = score(doc=2852,freq=2.0), product of:
            0.1417311 = queryWeight, product of:
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.047586527 = queryNorm
            0.16453418 = fieldWeight in 2852, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2852)
      0.5 = coord(1/2)
  0.5 = coord(2/4)
```
Abstract

Social tagging is one of the most popular methods for collecting crowd-sourced information in galleries, libraries, archives, and museums (GLAMs). However, when the number of social tags grows rapidly, using them becomes problematic and, as a result, they are often left as simply big data that cannot be used for practical purposes. To revitalize the use of this crowd-sourced information, we propose using social tags to link and cluster artworks based on an experimental study using an online collection at the Gyeonggi Museum of Modern Art (GMoMA). We view social tagging as a folksonomy, where artworks are classified by keywords of the crowd's various interpretations and one artwork can belong to several different categories simultaneously. To leverage this strength of social tags, we used a clustering method called "link communities" to detect overlapping communities in a network of artworks constructed by computing similarities between all artwork pairs. We used this framework to identify semantic relationships and clusters of similar artworks. By comparing the clustering results with curators' manual classification results, we demonstrated the potential of social tagging data for automatically clustering artworks in a way that reflects the dynamic perspectives of crowds.

Source

Journal of the Association for Information Science and Technology. 67(2016) no.4, S.885-899

Liu, R.-L.: Context-based term frequency assessment for text classification (2010) 0.01

0.013869986 = product of:
  0.027739972 = sum of:
    0.013748205 = weight(_text_:information in 3331) [ClassicSimilarity], result of:
      0.013748205 = score(doc=3331,freq=4.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.16457605 = fieldWeight in 3331, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=3331)
    0.013991767 = product of:
      0.027983533 = sum of:
        0.027983533 = weight(_text_:technology in 3331) [ClassicSimilarity], result of:
          0.027983533 = score(doc=3331,freq=2.0), product of:
            0.1417311 = queryWeight, product of:
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.047586527 = queryNorm
            0.19744103 = fieldWeight in 3331, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.046875 = fieldNorm(doc=3331)
      0.5 = coord(1/2)
  0.5 = coord(2/4)

Abstract: Automatic text classification (TC) is essential for the management of information. To properly classify a document d, it is essential to identify the semantics of each term t in d, while the semantics heavily depend on context (neighboring terms) of t in d. Therefore, we present a technique CTFA (Context-based Term Frequency Assessment) that improves text classifiers by considering term contexts in test documents. The results of the term context recognition are used to assess term frequencies of terms, and hence CTFA may easily work with various kinds of text classifiers that base their TC decisions on term frequencies, without needing to modify the classifiers. Moreover, CTFA is efficient, and neither huge memory nor domain-specific knowledge is required. Empirical results show that CTFA successfully enhances performance of several kinds of text classifiers on different experimental data.
Source: Journal of the American Society for Information Science and Technology. 61(2010) no.2, S.300-309

Liu, X.; Yu, S.; Janssens, F.; Glänzel, W.; Moreau, Y.; Moor, B.de: Weighted hybrid clustering by combining text mining and bibliometrics on a large-scale journal database (2010) 0.01
```
0.013869986 = product of:
  0.027739972 = sum of:
    0.013748205 = weight(_text_:information in 3464) [ClassicSimilarity], result of:
      0.013748205 = score(doc=3464,freq=4.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.16457605 = fieldWeight in 3464, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=3464)
    0.013991767 = product of:
      0.027983533 = sum of:
        0.027983533 = weight(_text_:technology in 3464) [ClassicSimilarity], result of:
          0.027983533 = score(doc=3464,freq=2.0), product of:
            0.1417311 = queryWeight, product of:
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.047586527 = queryNorm
            0.19744103 = fieldWeight in 3464, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.046875 = fieldNorm(doc=3464)
      0.5 = coord(1/2)
  0.5 = coord(2/4)
```
Abstract

We propose a new hybrid clustering framework to incorporate text mining with bibliometrics in journal set analysis. The framework integrates two different approaches: clustering ensemble and kernel-fusion clustering. To improve the flexibility and the efficiency of processing large-scale data, we propose an information-based weighting scheme to leverage the effect of multiple data sources in hybrid clustering. Three different algorithms are extended by the proposed weighting scheme and they are employed on a large journal set retrieved from the Web of Science (WoS) database. The clustering performance of the proposed algorithms is systematically evaluated using multiple evaluation methods, and they were cross-compared with alternative methods. Experimental results demonstrate that the proposed weighted hybrid clustering strategy is superior to other methods in clustering performance and efficiency. The proposed approach also provides a more refined structural mapping of journal sets, which is useful for monitoring and detecting new trends in different scientific fields.

Source

Journal of the American Society for Information Science and Technology. 61(2010) no.6, S.1105-1119

Cortez, E.; Herrera, M.R.; Silva, A.S. da; Moura, E.S. de; Neubert, M.: Lightweight methods for large-scale product categorization (2011) 0.01

0.013869986 = product of:
  0.027739972 = sum of:
    0.013748205 = weight(_text_:information in 4758) [ClassicSimilarity], result of:
      0.013748205 = score(doc=4758,freq=4.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.16457605 = fieldWeight in 4758, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=4758)
    0.013991767 = product of:
      0.027983533 = sum of:
        0.027983533 = weight(_text_:technology in 4758) [ClassicSimilarity], result of:
          0.027983533 = score(doc=4758,freq=2.0), product of:
            0.1417311 = queryWeight, product of:
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.047586527 = queryNorm
            0.19744103 = fieldWeight in 4758, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.046875 = fieldNorm(doc=4758)
      0.5 = coord(1/2)
  0.5 = coord(2/4)

Abstract: In this article, we present a study about classification methods for large-scale categorization of product offers on e-shopping web sites. We present a study about the performance of previously proposed approaches and deployed a probabilistic approach to model the classification problem. We also studied an alternative way of modeling information about the description of product offers and investigated the usage of price and store of product offers as features adopted in the classification process. Our experiments used two collections of over a million product offers previously categorized by human editors and taxonomies of hundreds of categories from a real e-shopping web site. In these experiments, our method achieved an improvement of up to 9% in the quality of the categorization in comparison with the best baseline we have found.
Source: Journal of the American Society for Information Science and Technology. 62(2011) no.9, S.1839-1848

Malo, P.; Sinha, A.; Wallenius, J.; Korhonen, P.: Concept-based document classification using Wikipedia and value function (2011) 0.01
```
0.013869986 = product of:
  0.027739972 = sum of:
    0.013748205 = weight(_text_:information in 4948) [ClassicSimilarity], result of:
      0.013748205 = score(doc=4948,freq=4.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.16457605 = fieldWeight in 4948, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=4948)
    0.013991767 = product of:
      0.027983533 = sum of:
        0.027983533 = weight(_text_:technology in 4948) [ClassicSimilarity], result of:
          0.027983533 = score(doc=4948,freq=2.0), product of:
            0.1417311 = queryWeight, product of:
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.047586527 = queryNorm
            0.19744103 = fieldWeight in 4948, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.046875 = fieldNorm(doc=4948)
      0.5 = coord(1/2)
  0.5 = coord(2/4)
```
Abstract

In this article, we propose a new concept-based method for document classification. The conceptual knowledge associated with the words is drawn from Wikipedia. The purpose is to utilize the abundant semantic relatedness information available in Wikipedia in an efficient value function-based query learning algorithm. The procedure learns the value function by solving a simple linear programming problem formulated using the training documents. The learning involves a step-wise iterative process that helps in generating a value function with an appropriate set of concepts (dimensions) chosen from a collection of concepts. Once the value function is formulated, it is utilized to make a decision between relevance and irrelevance. The value assigned to a particular document from the value function can be further used to rank the documents according to their relevance. Reuters newswire documents have been used to evaluate the efficacy of the procedure. An extensive comparison with other frameworks has been performed. The results are promising.

Source

Journal of the American Society for Information Science and Technology. 62(2011) no.12, S.2496-2511
Mu, T.; Goulermas, J.Y.; Korkontzelos, I.; Ananiadou, S.: Descriptive document clustering via discriminant learning in a co-embedded space of multilevel similarities (2016) 0.01
```
0.012845755 = product of:
  0.02569151 = sum of:
    0.0140317045 = weight(_text_:information in 2496) [ClassicSimilarity], result of:
      0.0140317045 = score(doc=2496,freq=6.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.16796975 = fieldWeight in 2496, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2496)
    0.011659805 = product of:
      0.02331961 = sum of:
        0.02331961 = weight(_text_:technology in 2496) [ClassicSimilarity], result of:
          0.02331961 = score(doc=2496,freq=2.0), product of:
            0.1417311 = queryWeight, product of:
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.047586527 = queryNorm
            0.16453418 = fieldWeight in 2496, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2496)
      0.5 = coord(1/2)
  0.5 = coord(2/4)
```
Abstract

Descriptive document clustering aims at discovering clusters of semantically interrelated documents together with meaningful labels to summarize the content of each document cluster. In this work, we propose a novel descriptive clustering framework, referred to as CEDL. It relies on the formulation and generation of 2 types of heterogeneous objects, which correspond to documents and candidate phrases, using multilevel similarity information. CEDL is composed of 5 main processing stages. First, it simultaneously maps the documents and candidate phrases into a common co-embedded space that preserves higher-order, neighbor-based proximities between the combined sets of documents and phrases. Then, it discovers an approximate cluster structure of documents in the common space. The third stage extracts promising topic phrases by constructing a discriminant model where documents along with their cluster memberships are used as training instances. Subsequently, the final cluster labels are selected from the topic phrases using a ranking scheme using multiple scores based on the extracted co-embedding information and the discriminant output. The final stage polishes the initial clusters to reduce noise and accommodate the multitopic nature of documents. The effectiveness and competitiveness of CEDL is demonstrated qualitatively and quantitatively with experiments using document databases from different application fields.

Source

Journal of the Association for Information Science and Technology. 67(2016) no.1, S.106-133
Golub, K.; Soergel, D.; Buchanan, G.; Tudhope, D.; Lykke, M.; Hiom, D.: ¬A framework for evaluating automatic indexing or classification in the context of retrieval (2016) 0.01
```
0.012845755 = product of:
  0.02569151 = sum of:
    0.0140317045 = weight(_text_:information in 3311) [ClassicSimilarity], result of:
      0.0140317045 = score(doc=3311,freq=6.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.16796975 = fieldWeight in 3311, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3311)
    0.011659805 = product of:
      0.02331961 = sum of:
        0.02331961 = weight(_text_:technology in 3311) [ClassicSimilarity], result of:
          0.02331961 = score(doc=3311,freq=2.0), product of:
            0.1417311 = queryWeight, product of:
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.047586527 = queryNorm
            0.16453418 = fieldWeight in 3311, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3311)
      0.5 = coord(1/2)
  0.5 = coord(2/4)
```
Abstract

Tools for automatic subject assignment help deal with scale and sustainability in creating and enriching metadata, establishing more connections across and between resources and enhancing consistency. Although some software vendors and experimental researchers claim the tools can replace manual subject indexing, hard scientific evidence of their performance in operating information environments is scarce. A major reason for this is that research is usually conducted in laboratory conditions, excluding the complexities of real-life systems and situations. The article reviews and discusses issues with existing evaluation approaches such as problems of aboutness and relevance assessments, implying the need to use more than a single "gold standard" method when evaluating indexing and retrieval, and proposes a comprehensive evaluation framework. The framework is informed by a systematic review of the literature on evaluation approaches: evaluating indexing quality directly through assessment by an evaluator or through comparison with a gold standard, evaluating the quality of computer-assisted indexing directly in the context of an indexing workflow, and evaluating indexing quality indirectly through analyzing retrieval performance.

Series

Advances in information science

Source

Journal of the Association for Information Science and Technology. 67(2016) no.1, S.3-16
Fagni, T.; Sebastiani, F.: Selecting negative examples for hierarchical text classification: An experimental comparison (2010) 0.01
```
0.012295332 = product of:
  0.024590664 = sum of:
    0.008101207 = weight(_text_:information in 4101) [ClassicSimilarity], result of:
      0.008101207 = score(doc=4101,freq=2.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.09697737 = fieldWeight in 4101, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4101)
    0.016489455 = product of:
      0.03297891 = sum of:
        0.03297891 = weight(_text_:technology in 4101) [ClassicSimilarity], result of:
          0.03297891 = score(doc=4101,freq=4.0), product of:
            0.1417311 = queryWeight, product of:
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.047586527 = queryNorm
            0.23268649 = fieldWeight in 4101, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4101)
      0.5 = coord(1/2)
  0.5 = coord(2/4)
```
Abstract

Hierarchical text classification (HTC) approaches have recently attracted a lot of interest on the part of researchers in human language technology and machine learning, since they have been shown to bring about equal, if not better, classification accuracy with respect to their "flat" counterparts while allowing exponential time savings at both learning and classification time. A typical component of HTC methods is a "local" policy for selecting negative examples: Given a category c, its negative training examples are by default identified with the training examples that are negative for c and positive for the categories which are siblings of c in the hierarchy. However, this policy has always been taken for granted and never been subjected to careful scrutiny since first proposed 15 years ago. This article proposes a thorough experimental comparison between this policy and three other policies for the selection of negative examples in HTC contexts, one of which (BEST LOCAL (k)) is being proposed for the first time in this article. We compare these policies on the hierarchical versions of three supervised learning algorithms (boosting, support vector machines, and naïve Bayes) by performing experiments on two standard TC datasets, REUTERS-21578 and RCV1-V2.

Source

Journal of the American Society for Information Science and Technology. 61(2010) no.11, S.2256-2265

Schaalje, G.B.; Blades, N.J.; Funai, T.: ¬An open-set size-adjusted Bayesian classifier for authorship attribution (2013) 0.01

0.011856608 = product of:
  0.023713216 = sum of:
    0.00972145 = weight(_text_:information in 1041) [ClassicSimilarity], result of:
      0.00972145 = score(doc=1041,freq=2.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.116372846 = fieldWeight in 1041, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=1041)
    0.013991767 = product of:
      0.027983533 = sum of:
        0.027983533 = weight(_text_:technology in 1041) [ClassicSimilarity], result of:
          0.027983533 = score(doc=1041,freq=2.0), product of:
            0.1417311 = queryWeight, product of:
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.047586527 = queryNorm
            0.19744103 = fieldWeight in 1041, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.046875 = fieldNorm(doc=1041)
      0.5 = coord(1/2)
  0.5 = coord(2/4)

Source: Journal of the American Society for Information Science and Technology. 64(2013) no.9, S.1815-1825

Aphinyanaphongs, Y.; Fu, L.D.; Li, Z.; Peskin, E.R.; Efstathiadis, E.; Aliferis, C.F.; Statnikov, A.: ¬A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization (2014) 0.01

0.011856608 = product of:
  0.023713216 = sum of:
    0.00972145 = weight(_text_:information in 1496) [ClassicSimilarity], result of:
      0.00972145 = score(doc=1496,freq=2.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.116372846 = fieldWeight in 1496, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=1496)
    0.013991767 = product of:
      0.027983533 = sum of:
        0.027983533 = weight(_text_:technology in 1496) [ClassicSimilarity], result of:
          0.027983533 = score(doc=1496,freq=2.0), product of:
            0.1417311 = queryWeight, product of:
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.047586527 = queryNorm
            0.19744103 = fieldWeight in 1496, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.046875 = fieldNorm(doc=1496)
      0.5 = coord(1/2)
  0.5 = coord(2/4)

Source: Journal of the Association for Information Science and Technology. 65(2014) no.10, S.1964-1987

Barbu, E.: What kind of knowledge is in Wikipedia? : unsupervised extraction of properties for similar concepts (2014) 0.01

0.011856608 = product of:
  0.023713216 = sum of:
    0.00972145 = weight(_text_:information in 1547) [ClassicSimilarity], result of:
      0.00972145 = score(doc=1547,freq=2.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.116372846 = fieldWeight in 1547, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=1547)
    0.013991767 = product of:
      0.027983533 = sum of:
        0.027983533 = weight(_text_:technology in 1547) [ClassicSimilarity], result of:
          0.027983533 = score(doc=1547,freq=2.0), product of:
            0.1417311 = queryWeight, product of:
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.047586527 = queryNorm
            0.19744103 = fieldWeight in 1547, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.046875 = fieldNorm(doc=1547)
      0.5 = coord(1/2)
  0.5 = coord(2/4)

Source: Journal of the Association for Information Science and Technology. 65(2014) no.12, S.2489-2497

Teich, E.; Degaetano-Ortlieb, S.; Fankhauser, P.; Kermes, H.; Lapshinova-Koltunski, E.: ¬The linguistic construal of disciplinarity : a data-mining approach using register features (2016) 0.01

0.011856608 = product of:
  0.023713216 = sum of:
    0.00972145 = weight(_text_:information in 3015) [ClassicSimilarity], result of:
      0.00972145 = score(doc=3015,freq=2.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.116372846 = fieldWeight in 3015, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=3015)
    0.013991767 = product of:
      0.027983533 = sum of:
        0.027983533 = weight(_text_:technology in 3015) [ClassicSimilarity], result of:
          0.027983533 = score(doc=3015,freq=2.0), product of:
            0.1417311 = queryWeight, product of:
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.047586527 = queryNorm
            0.19744103 = fieldWeight in 3015, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.046875 = fieldNorm(doc=3015)
      0.5 = coord(1/2)
  0.5 = coord(2/4)

Source: Journal of the Association for Information Science and Technology. 67(2016) no.7, S.1668-1678

Maghsoodi, N.; Homayounpour, M.M.: Improving Farsi multiclass text classification using a thesaurus and two-stage feature selection (2011) 0.01
```
0.011558321 = product of:
  0.023116643 = sum of:
    0.011456838 = weight(_text_:information in 4775) [ClassicSimilarity], result of:
      0.011456838 = score(doc=4775,freq=4.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.13714671 = fieldWeight in 4775, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4775)
    0.011659805 = product of:
      0.02331961 = sum of:
        0.02331961 = weight(_text_:technology in 4775) [ClassicSimilarity], result of:
          0.02331961 = score(doc=4775,freq=2.0), product of:
            0.1417311 = queryWeight, product of:
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.047586527 = queryNorm
            0.16453418 = fieldWeight in 4775, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4775)
      0.5 = coord(1/2)
  0.5 = coord(2/4)
```
Abstract

The progressive increase of information content has recently made it necessary to create a system for automatic classification of documents. In this article, a system is presented for the categorization of multiclass Farsi documents that requires fewer training examples and can help to compensate the shortcoming of the standard training dataset. The new idea proposed in the present article is based on extending the feature vector by adding some words extracted from a thesaurus and then filtering the new feature vector by applying secondary feature selection to discard inappropriate features. In fact, a phase of secondary feature selection is applied to choose more appropriate features among the features added from a thesaurus to enhance the effect of using a thesaurus on the efficiency of the classifier. To evaluate the proposed system, a corpus is gathered from the Farsi Wikipedia website and some articles in the Hamshahri newspaper, the Roshd periodical, and the Soroush magazine. In addition to studying the role of a thesaurus and applying secondary feature selection, the effect of a various number of categories, size of the training dataset, and average number of words in the test data also are examined. As the results indicate, classification efficiency improves by applying this approach, especially when available data is not sufficient for some text categories.

Source

Journal of the American Society for Information Science and Technology. 62(2011) no.10, S.2055-2066
Ma, Z.; Sun, A.; Cong, G.: On predicting the popularity of newly emerging hashtags in Twitter (2013) 0.01
```
0.011558321 = product of:
  0.023116643 = sum of:
    0.011456838 = weight(_text_:information in 967) [ClassicSimilarity], result of:
      0.011456838 = score(doc=967,freq=4.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.13714671 = fieldWeight in 967, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=967)
    0.011659805 = product of:
      0.02331961 = sum of:
        0.02331961 = weight(_text_:technology in 967) [ClassicSimilarity], result of:
          0.02331961 = score(doc=967,freq=2.0), product of:
            0.1417311 = queryWeight, product of:
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.047586527 = queryNorm
            0.16453418 = fieldWeight in 967, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.0390625 = fieldNorm(doc=967)
      0.5 = coord(1/2)
  0.5 = coord(2/4)
```
Abstract

Because of Twitter's popularity and the viral nature of information dissemination on Twitter, predicting which Twitter topics will become popular in the near future becomes a task of considerable economic importance. Many Twitter topics are annotated by hashtags. In this article, we propose methods to predict the popularity of new hashtags on Twitter by formulating the problem as a classification task. We use five standard classification models (i.e., Naïve bayes, k-nearest neighbors, decision trees, support vector machines, and logistic regression) for prediction. The main challenge is the identification of effective features for describing new hashtags. We extract 7 content features from a hashtag string and the collection of tweets containing the hashtag and 11 contextual features from the social graph formed by users who have adopted the hashtag. We conducted experiments on a Twitter data set consisting of 31 million tweets from 2 million Singapore-based users. The experimental results show that the standard classifiers using the extracted features significantly outperform the baseline methods that do not use these features. Among the five classifiers, the logistic regression model performs the best in terms of the Micro-F1 measure. We also observe that contextual features are more effective than content features.

Source

Journal of the American Society for Information Science and Technology. 64(2013) no.7, S.1399-1410
Vilares, D.; Alonso, M.A.; Gómez-Rodríguez, C.: On the usefulness of lexical and syntactic processing in polarity classification of Twitter messages (2015) 0.01
```
0.011558321 = product of:
  0.023116643 = sum of:
    0.011456838 = weight(_text_:information in 2161) [ClassicSimilarity], result of:
      0.011456838 = score(doc=2161,freq=4.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.13714671 = fieldWeight in 2161, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2161)
    0.011659805 = product of:
      0.02331961 = sum of:
        0.02331961 = weight(_text_:technology in 2161) [ClassicSimilarity], result of:
          0.02331961 = score(doc=2161,freq=2.0), product of:
            0.1417311 = queryWeight, product of:
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.047586527 = queryNorm
            0.16453418 = fieldWeight in 2161, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2161)
      0.5 = coord(1/2)
  0.5 = coord(2/4)
```
Abstract

Millions of micro texts are published every day on Twitter. Identifying the sentiment present in them can be helpful for measuring the frame of mind of the public, their satisfaction with respect to a product, or their support of a social event. In this context, polarity classification is a subfield of sentiment analysis focused on determining whether the content of a text is objective or subjective, and in the latter case, if it conveys a positive or a negative opinion. Most polarity detection techniques tend to take into account individual terms in the text and even some degree of linguistic knowledge, but they do not usually consider syntactic relations between words. This article explores how relating lexical, syntactic, and psychometric information can be helpful to perform polarity classification on Spanish tweets. We provide an evaluation for both shallow and deep linguistic perspectives. Empirical results show an improved performance of syntactic approaches over pure lexical models when using large training sets to create a classifier, but this tendency is reversed when small training collections are used.

Source

Journal of the Association for Information Science and Technology. 66(2015) no.9, S.1799-1816

Kishida, K.: High-speed rough clustering for very large document collections (2010) 0.01

0.0098805055 = product of:
  0.019761011 = sum of:
    0.008101207 = weight(_text_:information in 3463) [ClassicSimilarity], result of:
      0.008101207 = score(doc=3463,freq=2.0), product of:
        0.083537094 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.047586527 = queryNorm
        0.09697737 = fieldWeight in 3463, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3463)
    0.011659805 = product of:
      0.02331961 = sum of:
        0.02331961 = weight(_text_:technology in 3463) [ClassicSimilarity], result of:
          0.02331961 = score(doc=3463,freq=2.0), product of:
            0.1417311 = queryWeight, product of:
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.047586527 = queryNorm
            0.16453418 = fieldWeight in 3463, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.978387 = idf(docFreq=6114, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3463)
      0.5 = coord(1/2)
  0.5 = coord(2/4)

Source: Journal of the American Society for Information Science and Technology. 61(2010) no.6, S.1092-1104

Search (39 results, page 1 of 2)

Authors

Themes