Search (2261 results, page 1 of 114)

  • × year_i:[2010 TO 2020}
  1. Liu, R.-L.: Context-based term frequency assessment for text classification (2010) 0.47
    0.4729282 = sum of:
      0.0122329 = product of:
        0.0489316 = sum of:
          0.0489316 = weight(_text_:based in 3331) [ClassicSimilarity], result of:
            0.0489316 = score(doc=3331,freq=6.0), product of:
              0.14144066 = queryWeight, product of:
                3.0129938 = idf(docFreq=5906, maxDocs=44218)
                0.04694356 = queryNorm
              0.34595144 = fieldWeight in 3331, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                3.0129938 = idf(docFreq=5906, maxDocs=44218)
                0.046875 = fieldNorm(doc=3331)
        0.25 = coord(1/4)
      0.1916339 = weight(_text_:term in 3331) [ClassicSimilarity], result of:
        0.1916339 = score(doc=3331,freq=16.0), product of:
          0.21904005 = queryWeight, product of:
            4.66603 = idf(docFreq=1130, maxDocs=44218)
            0.04694356 = queryNorm
          0.8748806 = fieldWeight in 3331, product of:
            4.0 = tf(freq=16.0), with freq of:
              16.0 = termFreq=16.0
            4.66603 = idf(docFreq=1130, maxDocs=44218)
            0.046875 = fieldNorm(doc=3331)
      0.18691254 = weight(_text_:frequency in 3331) [ClassicSimilarity], result of:
        0.18691254 = score(doc=3331,freq=6.0), product of:
          0.27643865 = queryWeight, product of:
            5.888745 = idf(docFreq=332, maxDocs=44218)
            0.04694356 = queryNorm
          0.6761447 = fieldWeight in 3331, product of:
            2.4494898 = tf(freq=6.0), with freq of:
              6.0 = termFreq=6.0
            5.888745 = idf(docFreq=332, maxDocs=44218)
            0.046875 = fieldNorm(doc=3331)
      0.08214887 = product of:
        0.16429774 = sum of:
          0.16429774 = weight(_text_:assessment in 3331) [ClassicSimilarity], result of:
            0.16429774 = score(doc=3331,freq=6.0), product of:
              0.25917634 = queryWeight, product of:
                5.52102 = idf(docFreq=480, maxDocs=44218)
                0.04694356 = queryNorm
              0.63392264 = fieldWeight in 3331, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                5.52102 = idf(docFreq=480, maxDocs=44218)
                0.046875 = fieldNorm(doc=3331)
        0.5 = coord(1/2)
    
    Abstract
    Automatic text classification (TC) is essential for the management of information. To properly classify a document d, it is essential to identify the semantics of each term t in d, while the semantics heavily depend on context (neighboring terms) of t in d. Therefore, we present a technique CTFA (Context-based Term Frequency Assessment) that improves text classifiers by considering term contexts in test documents. The results of the term context recognition are used to assess term frequencies of terms, and hence CTFA may easily work with various kinds of text classifiers that base their TC decisions on term frequencies, without needing to modify the classifiers. Moreover, CTFA is efficient, and neither huge memory nor domain-specific knowledge is required. Empirical results show that CTFA successfully enhances performance of several kinds of text classifiers on different experimental data.
    Object
    Context-based Term Frequency Assessment
  2. Efron, M.: Linear time series models for term weighting in information retrieval (2010) 0.29
    0.29163787 = product of:
      0.38885048 = sum of:
        0.0070626684 = product of:
          0.028250674 = sum of:
            0.028250674 = weight(_text_:based in 3688) [ClassicSimilarity], result of:
              0.028250674 = score(doc=3688,freq=2.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.19973516 = fieldWeight in 3688, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.046875 = fieldNorm(doc=3688)
          0.25 = coord(1/4)
        0.1659598 = weight(_text_:term in 3688) [ClassicSimilarity], result of:
          0.1659598 = score(doc=3688,freq=12.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.7576688 = fieldWeight in 3688, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.046875 = fieldNorm(doc=3688)
        0.215828 = weight(_text_:frequency in 3688) [ClassicSimilarity], result of:
          0.215828 = score(doc=3688,freq=8.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.7807447 = fieldWeight in 3688, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.046875 = fieldNorm(doc=3688)
      0.75 = coord(3/4)
    
    Abstract
    Common measures of term importance in information retrieval (IR) rely on counts of term frequency; rare terms receive higher weight in document ranking than common terms receive. However, realistic scenarios yield additional information about terms in a collection. Of interest in this article is the temporal behavior of terms as a collection changes over time. We propose capturing each term's collection frequency at discrete time intervals over the lifespan of a corpus and analyzing the resulting time series. We hypothesize the collection frequency of a weakly discriminative term x at time t is predictable by a linear model of the term's prior observations. On the other hand, a linear time series model for a strong discriminators' collection frequency will yield a poor fit to the data. Operationalizing this hypothesis, we induce three time-based measures of term importance and test these against state-of-the-art term weighting models.
  3. Zanibbi, R.; Yuan, B.: Keyword and image-based retrieval for mathematical expressions (2011) 0.25
    0.24943498 = sum of:
      0.009988121 = product of:
        0.039952483 = sum of:
          0.039952483 = weight(_text_:based in 3449) [ClassicSimilarity], result of:
            0.039952483 = score(doc=3449,freq=4.0), product of:
              0.14144066 = queryWeight, product of:
                3.0129938 = idf(docFreq=5906, maxDocs=44218)
                0.04694356 = queryNorm
              0.28246817 = fieldWeight in 3449, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.0129938 = idf(docFreq=5906, maxDocs=44218)
                0.046875 = fieldNorm(doc=3449)
        0.25 = coord(1/4)
      0.06775281 = weight(_text_:term in 3449) [ClassicSimilarity], result of:
        0.06775281 = score(doc=3449,freq=2.0), product of:
          0.21904005 = queryWeight, product of:
            4.66603 = idf(docFreq=1130, maxDocs=44218)
            0.04694356 = queryNorm
          0.309317 = fieldWeight in 3449, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.66603 = idf(docFreq=1130, maxDocs=44218)
            0.046875 = fieldNorm(doc=3449)
      0.15261345 = weight(_text_:frequency in 3449) [ClassicSimilarity], result of:
        0.15261345 = score(doc=3449,freq=4.0), product of:
          0.27643865 = queryWeight, product of:
            5.888745 = idf(docFreq=332, maxDocs=44218)
            0.04694356 = queryNorm
          0.55206984 = fieldWeight in 3449, product of:
            2.0 = tf(freq=4.0), with freq of:
              4.0 = termFreq=4.0
            5.888745 = idf(docFreq=332, maxDocs=44218)
            0.046875 = fieldNorm(doc=3449)
      0.019080611 = product of:
        0.038161222 = sum of:
          0.038161222 = weight(_text_:22 in 3449) [ClassicSimilarity], result of:
            0.038161222 = score(doc=3449,freq=2.0), product of:
              0.16438834 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.04694356 = queryNorm
              0.23214069 = fieldWeight in 3449, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.046875 = fieldNorm(doc=3449)
        0.5 = coord(1/2)
    
    Abstract
    Two new methods for retrieving mathematical expressions using conventional keyword search and expression images are presented. An expression-level TF-IDF (term frequency-inverse document frequency) approach is used for keyword search, where queries and indexed expressions are represented by keywords taken from LATEX strings. TF-IDF is computed at the level of individual expressions rather than documents to increase the precision of matching. The second retrieval technique is a form of Content-Base Image Retrieval (CBIR). Expressions are segmented into connected components, and then components in the query expression and each expression in the collection are matched using contour and density features, aspect ratios, and relative positions. In an experiment using ten randomly sampled queries from a corpus of over 22,000 expressions, precision-at-k (k= 20) for the keyword-based approach was higher (keyword: µ= 84.0,s= 19.0, image-based:µ= 32.0,s= 30.7), but for a few of the queries better results were obtained using a combination of the two techniques.
    Date
    22. 2.2017 12:53:49
  4. Nunes, S.; Ribeiro, C.; David, G.: Term weighting based on document revision history (2011) 0.24
    0.2428341 = product of:
      0.3237788 = sum of:
        0.008323434 = product of:
          0.033293735 = sum of:
            0.033293735 = weight(_text_:based in 4946) [ClassicSimilarity], result of:
              0.033293735 = score(doc=4946,freq=4.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.23539014 = fieldWeight in 4946, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=4946)
          0.25 = coord(1/4)
        0.15969492 = weight(_text_:term in 4946) [ClassicSimilarity], result of:
          0.15969492 = score(doc=4946,freq=16.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.7290672 = fieldWeight in 4946, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4946)
        0.15576044 = weight(_text_:frequency in 4946) [ClassicSimilarity], result of:
          0.15576044 = score(doc=4946,freq=6.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.5634539 = fieldWeight in 4946, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4946)
      0.75 = coord(3/4)
    
    Abstract
    In real-world information retrieval systems, the underlying document collection is rarely stable or definitive. This work is focused on the study of signals extracted from the content of documents at different points in time for the purpose of weighting individual terms in a document. The basic idea behind our proposals is that terms that have existed for a longer time in a document should have a greater weight. We propose 4 term weighting functions that use each document's history to estimate a current term score. To evaluate this thesis, we conduct 3 independent experiments using a collection of documents sampled from Wikipedia. In the first experiment, we use data from Wikipedia to judge each set of terms. In a second experiment, we use an external collection of tags from a popular social bookmarking service as a gold standard. In the third experiment, we crowdsource user judgments to collect feedback on term preference. Across all experiments results consistently support our thesis. We show that temporally aware measures, specifically the proposed revision term frequency and revision term frequency span, outperform a term-weighting measure based on raw term frequency alone.
  5. Dang, E.K.F.; Luk, R.W.P.; Allan, J.: Beyond bag-of-words : bigram-enhanced context-dependent term weights (2014) 0.23
    0.23006546 = product of:
      0.30675396 = sum of:
        0.010194084 = product of:
          0.040776335 = sum of:
            0.040776335 = weight(_text_:based in 1283) [ClassicSimilarity], result of:
              0.040776335 = score(doc=1283,freq=6.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.28829288 = fieldWeight in 1283, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1283)
          0.25 = coord(1/4)
        0.16938202 = weight(_text_:term in 1283) [ClassicSimilarity], result of:
          0.16938202 = score(doc=1283,freq=18.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.7732925 = fieldWeight in 1283, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1283)
        0.12717786 = weight(_text_:frequency in 1283) [ClassicSimilarity], result of:
          0.12717786 = score(doc=1283,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.46005818 = fieldWeight in 1283, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1283)
      0.75 = coord(3/4)
    
    Abstract
    While term independence is a widely held assumption in most of the established information retrieval approaches, it is clearly not true and various works in the past have investigated a relaxation of the assumption. One approach is to use n-grams in document representation instead of unigrams. However, the majority of early works on n-grams obtained only modest performance improvement. On the other hand, the use of information based on supporting terms or "contexts" of queries has been found to be promising. In particular, recent studies showed that using new context-dependent term weights improved the performance of relevance feedback (RF) retrieval compared with using traditional bag-of-words BM25 term weights. Calculation of the new term weights requires an estimation of the local probability of relevance of each query term occurrence. In previous studies, the estimation of this probability was based on unigrams that occur in the neighborhood of a query term. We explore an integration of the n-gram and context approaches by computing context-dependent term weights based on a mixture of unigrams and bigrams. Extensive experiments are performed using the title queries of the Text Retrieval Conference (TREC)-6, TREC-7, TREC-8, and TREC-2005 collections, for RF with relevance judgment of either the top 10 or top 20 documents of an initial retrieval. We identify some crucial elements needed in the use of bigrams in our methods, such as proper inverse document frequency (IDF) weighting of the bigrams and noise reduction by pruning bigrams with large document frequency values. We show that enhancing context-dependent term weights with bigrams is effective in further improving retrieval performance.
  6. Shibata, N.; Kajikawa, Y.; Sakata, I.: Link prediction in citation networks (2012) 0.17
    0.17057168 = product of:
      0.22742891 = sum of:
        0.0070626684 = product of:
          0.028250674 = sum of:
            0.028250674 = weight(_text_:based in 4964) [ClassicSimilarity], result of:
              0.028250674 = score(doc=4964,freq=2.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.19973516 = fieldWeight in 4964, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.046875 = fieldNorm(doc=4964)
          0.25 = coord(1/4)
        0.06775281 = weight(_text_:term in 4964) [ClassicSimilarity], result of:
          0.06775281 = score(doc=4964,freq=2.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.309317 = fieldWeight in 4964, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.046875 = fieldNorm(doc=4964)
        0.15261345 = weight(_text_:frequency in 4964) [ClassicSimilarity], result of:
          0.15261345 = score(doc=4964,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.55206984 = fieldWeight in 4964, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.046875 = fieldNorm(doc=4964)
      0.75 = coord(3/4)
    
    Abstract
    In this article, we build models to predict the existence of citations among papers by formulating link prediction for 5 large-scale datasets of citation networks. The supervised machine-learning model is applied with 11 features. As a result, our learner performs very well, with the F1 values of between 0.74 and 0.82. Three features in particular, link-based Jaccard coefficient difference in betweenness centrality, and cosine similarity of term frequency-inverse document frequency vectors, largely affect the predictions of citations. The results also indicate that different models are required for different types of research areas-research fields with a single issue or research fields with multiple issues. In the case of research fields with multiple issues, there are barriers among research fields because our results indicate that papers tend to be cited in each research field locally. Therefore, one must consider the typology of targeted research areas when building models for link prediction in citation networks.
  7. Doko, A.; Stula, , M.; Seric, L.: Improved sentence retrieval using local context and sentence length (2013) 0.17
    0.17057168 = product of:
      0.22742891 = sum of:
        0.0070626684 = product of:
          0.028250674 = sum of:
            0.028250674 = weight(_text_:based in 2705) [ClassicSimilarity], result of:
              0.028250674 = score(doc=2705,freq=2.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.19973516 = fieldWeight in 2705, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2705)
          0.25 = coord(1/4)
        0.06775281 = weight(_text_:term in 2705) [ClassicSimilarity], result of:
          0.06775281 = score(doc=2705,freq=2.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.309317 = fieldWeight in 2705, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.046875 = fieldNorm(doc=2705)
        0.15261345 = weight(_text_:frequency in 2705) [ClassicSimilarity], result of:
          0.15261345 = score(doc=2705,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.55206984 = fieldWeight in 2705, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.046875 = fieldNorm(doc=2705)
      0.75 = coord(3/4)
    
    Abstract
    In this paper we propose improved variants of the sentence retrieval method TF-ISF (a TF-IDF or Term Frequency-Inverse Document Frequency variant for sentence retrieval). The improvement is achieved by using context consisting of neighboring sentences and at the same time promoting the retrieval of longer sentences. We thoroughly compare new modified TF-ISF methods to the TF-ISF baseline, to an earlier attempt to include context into TF-ISF named tfmix and to a language modeling based method that uses context and promoting retrieval of long sentences named 3MMPDS. Experimental results show that the TF-ISF method can be improved using local context. Results also show that the TF-ISF method can be improved by promoting the retrieval of longer sentences. Finally we show that the best results are achieved when combining both modifications. All new methods (TF-ISF variants) also show statistically significant better results than the other tested methods.
  8. Liu, R.-L.; Huang, Y.-C.: Ranker enhancement for proximity-based ranking of biomedical texts (2011) 0.17
    0.16837625 = product of:
      0.22450167 = sum of:
        0.008323434 = product of:
          0.033293735 = sum of:
            0.033293735 = weight(_text_:based in 4947) [ClassicSimilarity], result of:
              0.033293735 = score(doc=4947,freq=4.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.23539014 = fieldWeight in 4947, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=4947)
          0.25 = coord(1/4)
        0.12624991 = weight(_text_:term in 4947) [ClassicSimilarity], result of:
          0.12624991 = score(doc=4947,freq=10.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.5763782 = fieldWeight in 4947, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4947)
        0.08992833 = weight(_text_:frequency in 4947) [ClassicSimilarity], result of:
          0.08992833 = score(doc=4947,freq=2.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.32531026 = fieldWeight in 4947, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4947)
      0.75 = coord(3/4)
    
    Abstract
    Biomedical decision making often requires relevant evidence from the biomedical literature. Retrieval of the evidence calls for a system that receives a natural language query for a biomedical information need and, among the huge amount of texts retrieved for the query, ranks relevant texts higher for further processing. However, state-of-the-art text rankers have weaknesses in dealing with biomedical queries, which often consist of several correlating concepts and prefer those texts that completely talk about the concepts. In this article, we present a technique, Proximity-Based Ranker Enhancer (PRE), to enhance text rankers by term-proximity information. PRE assesses the term frequency (TF) of each term in the text by integrating three types of term proximity to measure the contextual completeness of query terms appearing in nearby areas in the text being ranked. Therefore, PRE may serve as a preprocessor for (or supplement to) those rankers that consider TF in ranking, without the need to change the algorithms and development processes of the rankers. Empirical evaluation shows that PRE significantly improves various kinds of text rankers, and when compared with several state-of-the-art techniques that enhance rankers by term-proximity information, PRE may more stably and significantly enhance the rankers.
  9. Ko, Y.: ¬A new term-weighting scheme for text classification using the odds of positive and negative class probabilities (2015) 0.17
    0.16593526 = product of:
      0.33187053 = sum of:
        0.17925708 = weight(_text_:term in 2339) [ClassicSimilarity], result of:
          0.17925708 = score(doc=2339,freq=14.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.8183758 = fieldWeight in 2339, product of:
              3.7416575 = tf(freq=14.0), with freq of:
                14.0 = termFreq=14.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.046875 = fieldNorm(doc=2339)
        0.15261345 = weight(_text_:frequency in 2339) [ClassicSimilarity], result of:
          0.15261345 = score(doc=2339,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.55206984 = fieldWeight in 2339, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.046875 = fieldNorm(doc=2339)
      0.5 = coord(2/4)
    
    Abstract
    Text classification (TC) is a core technique for text mining and information retrieval. It has been applied to many applications in many different research and industrial areas. Term-weighting schemes assign an appropriate weight to each term to obtain a high TC performance. Although term weighting is one of the important modules for TC and TC has different peculiarities from those in information retrieval, many term-weighting schemes used in information retrieval, such as term frequency-inverse document frequency (tf-idf), have been used in TC in the same manner. The peculiarity of TC that differs most from information retrieval is the existence of class information. This article proposes a new term-weighting scheme that uses class information using positive and negative class distributions. As a result, the proposed scheme, log tf-TRR, consistently performs better than do other schemes using class information as well as traditional schemes such as tf-idf.
  10. Savoy, J.: Text representation strategies : an example with the State of the union addresses (2016) 0.16
    0.15837984 = product of:
      0.21117312 = sum of:
        0.008323434 = product of:
          0.033293735 = sum of:
            0.033293735 = weight(_text_:based in 3042) [ClassicSimilarity], result of:
              0.033293735 = score(doc=3042,freq=4.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.23539014 = fieldWeight in 3042, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3042)
          0.25 = coord(1/4)
        0.11292135 = weight(_text_:term in 3042) [ClassicSimilarity], result of:
          0.11292135 = score(doc=3042,freq=8.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.5155283 = fieldWeight in 3042, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3042)
        0.08992833 = weight(_text_:frequency in 3042) [ClassicSimilarity], result of:
          0.08992833 = score(doc=3042,freq=2.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.32531026 = fieldWeight in 3042, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3042)
      0.75 = coord(3/4)
    
    Abstract
    Based on State of the Union addresses from 1790 to 2014 (225 speeches delivered by 42 presidents), this paper describes and evaluates different text representation strategies. To determine the most important words of a given text, the term frequencies (tf) or the tf?idf weighting scheme can be applied. Recently, latent Dirichlet allocation (LDA) has been proposed to define the topics included in a corpus. As another strategy, this study proposes to apply a vocabulary specificity measure (Z?score) to determine the most significantly overused word-types or short sequences of them. Our experiments show that the simple term frequency measure is not able to discriminate between specific terms associated with a document or a set of texts. Using the tf idf or LDA approach, the selection requires some arbitrary decisions. Based on the term-specific measure (Z?score), the term selection has a clear theoretical basis. Moreover, the most significant sentences for each presidency can be determined. As another facet, we can visualize the dynamic evolution of usage of some terms associated with their specificity measures. Finally, this technique can be employed to define the most important lexical leaders introducing terms overused by the k following presidencies.
  11. Li, X.; Zhang, A.; Li, C.; Ouyang, J.; Cai, Y.: Exploring coherent topics by topic modeling with term weighting (2018) 0.16
    0.15655142 = product of:
      0.20873523 = sum of:
        0.005885557 = product of:
          0.023542227 = sum of:
            0.023542227 = weight(_text_:based in 5045) [ClassicSimilarity], result of:
              0.023542227 = score(doc=5045,freq=2.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.16644597 = fieldWeight in 5045, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5045)
          0.25 = coord(1/4)
        0.11292135 = weight(_text_:term in 5045) [ClassicSimilarity], result of:
          0.11292135 = score(doc=5045,freq=8.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.5155283 = fieldWeight in 5045, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5045)
        0.08992833 = weight(_text_:frequency in 5045) [ClassicSimilarity], result of:
          0.08992833 = score(doc=5045,freq=2.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.32531026 = fieldWeight in 5045, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5045)
      0.75 = coord(3/4)
    
    Abstract
    Topic models often produce unexplainable topics that are filled with noisy words. The reason is that words in topic modeling have equal weights. High frequency words dominate the top topic word lists, but most of them are meaningless words, e.g., domain-specific stopwords. To address this issue, in this paper we aim to investigate how to weight words, and then develop a straightforward but effective term weighting scheme, namely entropy weighting (EW). The proposed EW scheme is based on conditional entropy measured by word co-occurrences. Compared with existing term weighting schemes, the highlight of EW is that it can automatically reward informative words. For more robust word weight, we further suggest a combination form of EW (CEW) with two existing weighting schemes. Basically, our CEW assigns meaningless words lower weights and informative words higher weights, leading to more coherent topics during topic modeling inference. We apply CEW to Dirichlet multinomial mixture and latent Dirichlet allocation, and evaluate it by topic quality, document clustering and classification tasks on 8 real world data sets. Experimental results show that weighting words can effectively improve the topic modeling performance over both short texts and normal long texts. More importantly, the proposed CEW significantly outperforms the existing term weighting schemes, since it further considers which words are informative.
  12. Chen, T.T.: ¬The congruity between linkage-based factors and content-based clusters : an experimental study using multiple document corpora (2016) 0.15
    0.15236904 = product of:
      0.20315872 = sum of:
        0.019520182 = product of:
          0.07808073 = sum of:
            0.07808073 = weight(_text_:based in 2775) [ClassicSimilarity], result of:
              0.07808073 = score(doc=2775,freq=22.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.5520388 = fieldWeight in 2775, product of:
                  4.690416 = tf(freq=22.0), with freq of:
                    22.0 = termFreq=22.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2775)
          0.25 = coord(1/4)
        0.056460675 = weight(_text_:term in 2775) [ClassicSimilarity], result of:
          0.056460675 = score(doc=2775,freq=2.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.25776416 = fieldWeight in 2775, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2775)
        0.12717786 = weight(_text_:frequency in 2775) [ClassicSimilarity], result of:
          0.12717786 = score(doc=2775,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.46005818 = fieldWeight in 2775, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2775)
      0.75 = coord(3/4)
    
    Abstract
    Intellectual Structure (IS) is a bibliometric method that is widely applied in knowledge domain analysis and in science mapping. An intellectual structure consists of clusters of related documents ascribed to individual factors. Documents ascribed to a factor are generally associated with a common research theme. As such, the contents of documents ascribed to a factor are theorized to be similar to each other. This study shows that the link-based relatedness implies content-based similarity. The intellectual structures of two research domains were derived from data sets retrieved from the Microsoft Academic Search database. The collection of documents ascribed to a factor is referred to as a factor-based document cluster, which the content-based document clusters are compared with. All documents in an intellectual structure are re-clustered based on their content similarity, which is derived from the cosine of their vector form encoded in documents' term frequency inverse document frequency (TF-IDF) weighted terms. The factor-based document clusters are then compared with the content-based clusters for congruity. We used the Rand index and kappa coefficient to check the congruity between the factor-based and content-based document clusters. The kappa coefficient indicates that there is fair to moderate agreement between the clusters derived from these two different bases.
  13. Dang, E.K.F.; Luk, R.W.P.; Allan, J.; Ho, K.S.; Chung, K.F.L.; Lee, D.L.: ¬A new context-dependent term weight computed by boost and discount using relevance information (2010) 0.14
    0.1434364 = product of:
      0.2868728 = sum of:
        0.15969492 = weight(_text_:term in 4120) [ClassicSimilarity], result of:
          0.15969492 = score(doc=4120,freq=16.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.7290672 = fieldWeight in 4120, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4120)
        0.12717786 = weight(_text_:frequency in 4120) [ClassicSimilarity], result of:
          0.12717786 = score(doc=4120,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.46005818 = fieldWeight in 4120, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4120)
      0.5 = coord(2/4)
    
    Abstract
    We studied the effectiveness of a new class of context-dependent term weights for information retrieval. Unlike the traditional term frequency-inverse document frequency (TF-IDF), the new weighting of a term t in a document d depends not only on the occurrence statistics of t alone but also on the terms found within a text window (or "document-context") centered on t. We introduce a Boost and Discount (B&D) procedure which utilizes partial relevance information to compute the context-dependent term weights of query terms according to a logistic regression model. We investigate the effectiveness of the new term weights compared with the context-independent BM25 weights in the setting of relevance feedback. We performed experiments with title queries of the TREC-6, -7, -8, and 2005 collections, comparing the residual Mean Average Precision (MAP) measures obtained using B&D term weights and those obtained by a baseline using BM25 weights. Given either 10 or 20 relevance judgments of the top retrieved documents, using the new term weights yields improvement over the baseline for all collections tested. The MAP obtained with the new weights has relative improvement over the baseline by 3.3 to 15.2%, with statistical significance at the 95% confidence level across all four collections.
  14. Mayr, P.; Schaer, P.; Mutschke, P.: ¬A science model driven retrieval prototype (2011) 0.14
    0.1392412 = product of:
      0.18565494 = sum of:
        0.009988121 = product of:
          0.039952483 = sum of:
            0.039952483 = weight(_text_:based in 649) [ClassicSimilarity], result of:
              0.039952483 = score(doc=649,freq=4.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.28246817 = fieldWeight in 649, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.046875 = fieldNorm(doc=649)
          0.25 = coord(1/4)
        0.06775281 = weight(_text_:term in 649) [ClassicSimilarity], result of:
          0.06775281 = score(doc=649,freq=2.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.309317 = fieldWeight in 649, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.046875 = fieldNorm(doc=649)
        0.107914 = weight(_text_:frequency in 649) [ClassicSimilarity], result of:
          0.107914 = score(doc=649,freq=2.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.39037234 = fieldWeight in 649, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.046875 = fieldNorm(doc=649)
      0.75 = coord(3/4)
    
    Abstract
    This paper is about a better understanding of the structure and dynamics of science and the usage of these insights for compensating the typical problems that arises in metadata-driven Digital Libraries. Three science model driven retrieval services are presented: co-word analysis based query expansion, re-ranking via Bradfordizing and author centrality. The services are evaluated with relevance assessments from which two important implications emerge: (1) precision values of the retrieval services are the same or better than the tf-idf retrieval baseline and (2) each service retrieved a disjoint set of documents. The different services each favor quite other - but still relevant - documents than pure term-frequency based rankings. The proposed models and derived retrieval services therefore open up new viewpoints on the scientific knowledge space and provide an alternative framework to structure scholarly information systems.
  15. Joo, S.; Choi, I.; Choi, N.: Topic analysis of the research domain in knowledge organization : a Latent Dirichlet Allocation approach (2018) 0.14
    0.13704711 = product of:
      0.18272948 = sum of:
        0.0070626684 = product of:
          0.028250674 = sum of:
            0.028250674 = weight(_text_:based in 4304) [ClassicSimilarity], result of:
              0.028250674 = score(doc=4304,freq=2.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.19973516 = fieldWeight in 4304, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.046875 = fieldNorm(doc=4304)
          0.25 = coord(1/4)
        0.06775281 = weight(_text_:term in 4304) [ClassicSimilarity], result of:
          0.06775281 = score(doc=4304,freq=2.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.309317 = fieldWeight in 4304, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.046875 = fieldNorm(doc=4304)
        0.107914 = weight(_text_:frequency in 4304) [ClassicSimilarity], result of:
          0.107914 = score(doc=4304,freq=2.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.39037234 = fieldWeight in 4304, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.046875 = fieldNorm(doc=4304)
      0.75 = coord(3/4)
    
    Abstract
    Based on text mining, this study explored topics in the research domain of knowledge organization. A text corpus consisting of titles and abstracts was generated from 282 articles of the Knowledge Organization journal for the recent ten years from 2006 to 2015. Term frequency analysis and Latent Dirichlet allocation topic modeling were employed to analyze the collected corpus. Topic modeling uncovered twenty research topics prevailing in the knowledge organization field, including theories and epistemology, classification scheme, domain analysis and ontology, digital archiving, document indexing and retrieval, taxonomy and thesaurus system, metadata and controlled vocabulary, ethical issues, and others. In addition, topic trends over the tenyears were examined to identify topics that attracted more discussion in the journal. The top two topics that received increased attention recently were "ethical issues in knowledge organization" and "domain analysis and ontologies." This study yields insight into a better understanding of the research domain of knowledge organization. Moreover, text mining approaches introduced in this study have methodological implications for domain analysis in knowledge organization.
  16. Chew, S.W.; Khoo, K.S.G.: Comparison of drug information on consumer drug review sites versus authoritative health information websites (2016) 0.13
    0.12989628 = product of:
      0.17319503 = sum of:
        0.005885557 = product of:
          0.023542227 = sum of:
            0.023542227 = weight(_text_:based in 2643) [ClassicSimilarity], result of:
              0.023542227 = score(doc=2643,freq=2.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.16644597 = fieldWeight in 2643, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2643)
          0.25 = coord(1/4)
        0.056460675 = weight(_text_:term in 2643) [ClassicSimilarity], result of:
          0.056460675 = score(doc=2643,freq=2.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.25776416 = fieldWeight in 2643, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2643)
        0.11084881 = sum of:
          0.079047784 = weight(_text_:assessment in 2643) [ClassicSimilarity], result of:
            0.079047784 = score(doc=2643,freq=2.0), product of:
              0.25917634 = queryWeight, product of:
                5.52102 = idf(docFreq=480, maxDocs=44218)
                0.04694356 = queryNorm
              0.30499613 = fieldWeight in 2643, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.52102 = idf(docFreq=480, maxDocs=44218)
                0.0390625 = fieldNorm(doc=2643)
          0.031801023 = weight(_text_:22 in 2643) [ClassicSimilarity], result of:
            0.031801023 = score(doc=2643,freq=2.0), product of:
              0.16438834 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.04694356 = queryNorm
              0.19345059 = fieldWeight in 2643, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.0390625 = fieldNorm(doc=2643)
      0.75 = coord(3/4)
    
    Abstract
    Large amounts of health-related information of different types are available on the web. In addition to authoritative health information sites maintained by government health departments and healthcare institutions, there are many social media sites carrying user-contributed information. This study sought to identify the types of drug information available on consumer-contributed drug review sites when compared with authoritative drug information websites. Content analysis was performed on the information available for nine drugs on three authoritative sites (RxList, eMC, and PDRhealth) as well as three drug review sites (WebMD, RateADrug, and PatientsLikeMe). The types of information found on authoritative sites but rarely on drug review sites include pharmacology, special population considerations, contraindications, and drug interactions. Types of information found only on drug review sites include drug efficacy, drug resistance experienced by long-term users, cost of drug in relation to insurance coverage, availability of generic forms, comparison with other similar drugs and with other versions of the drug, difficulty in using the drug, and advice on coping with side effects. Drug efficacy ratings by users were found to be different across the three sites. Side effects were vividly described in context, with user assessment of severity based on discomfort and effect on their lives.
    Date
    22. 1.2016 12:24:05
  17. Kauchak, D.; Leroy, G.; Hogue, A.: Measuring text difficulty using parse-tree frequency (2017) 0.13
    0.12985206 = product of:
      0.2597041 = sum of:
        0.07984746 = weight(_text_:term in 3786) [ClassicSimilarity], result of:
          0.07984746 = score(doc=3786,freq=4.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.3645336 = fieldWeight in 3786, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3786)
        0.17985666 = weight(_text_:frequency in 3786) [ClassicSimilarity], result of:
          0.17985666 = score(doc=3786,freq=8.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.6506205 = fieldWeight in 3786, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3786)
      0.5 = coord(2/4)
    
    Abstract
    Text simplification often relies on dated, unproven readability formulas. As an alternative and motivated by the success of term familiarity, we test a complementary measure: grammar familiarity. Grammar familiarity is measured as the frequency of the 3rd level sentence parse tree and is useful for evaluating individual sentences. We created a database of 140K unique 3rd level parse structures by parsing and binning all 5.4M sentences in English Wikipedia. We then calculated the grammar frequencies across the corpus and created 11 frequency bins. We evaluate the measure with a user study and corpus analysis. For the user study, we selected 20 sentences randomly from each bin, controlling for sentence length and term frequency, and recruited 30 readers per sentence (N = 6,600) on Amazon Mechanical Turk. We measured actual difficulty (comprehension) using a Cloze test, perceived difficulty using a 5-point Likert scale, and time taken. Sentences with more frequent grammatical structures, even with very different surface presentations, were easier to understand, perceived as easier, and took less time to read. Outcomes from readability formulas correlated with perceived but not with actual difficulty. Our corpus analysis shows how the metric can be used to understand grammar regularity in a broad range of corpora.
  18. Moura, E.S. de; Fernandes, D.; Ribeiro-Neto, B.; Silva, A.S. da; Gonçalves, M.A.: Using structural information to improve search in Web collections (2010) 0.12
    0.1242152 = product of:
      0.2484304 = sum of:
        0.09581695 = weight(_text_:term in 4119) [ClassicSimilarity], result of:
          0.09581695 = score(doc=4119,freq=4.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.4374403 = fieldWeight in 4119, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.046875 = fieldNorm(doc=4119)
        0.15261345 = weight(_text_:frequency in 4119) [ClassicSimilarity], result of:
          0.15261345 = score(doc=4119,freq=4.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.55206984 = fieldWeight in 4119, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.046875 = fieldNorm(doc=4119)
      0.5 = coord(2/4)
    
    Abstract
    In this work, we investigate the problem of using the block structure of Web pages to improve ranking results. Starting with basic intuitions provided by the concepts of term frequency (TF) and inverse document frequency (IDF), we propose nine block-weight functions to distinguish the impact of term occurrences inside page blocks, instead of inside whole pages. These are then used to compute a modified BM25 ranking function. Using four distinct Web collections, we ran extensive experiments to compare our block-weight ranking formulas with two other baselines: (a) a BM25 ranking applied to full pages, and (b) a BM25 ranking that takes into account best blocks. Our methods suggest that our block-weighting ranking method is superior to all baselines across all collections we used and that average gain in precision figures from 5 to 20% are generated.
  19. Ni, C.; Shaw, D.; Lind, S.M.; Ding, Y.: Journal impact and proximity : an assessment using bibliographic features (2013) 0.12
    0.1239981 = product of:
      0.1653308 = sum of:
        0.009988121 = product of:
          0.039952483 = sum of:
            0.039952483 = weight(_text_:based in 686) [ClassicSimilarity], result of:
              0.039952483 = score(doc=686,freq=4.0), product of:
                0.14144066 = queryWeight, product of:
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.04694356 = queryNorm
                0.28246817 = fieldWeight in 686, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.0129938 = idf(docFreq=5906, maxDocs=44218)
                  0.046875 = fieldNorm(doc=686)
          0.25 = coord(1/4)
        0.107914 = weight(_text_:frequency in 686) [ClassicSimilarity], result of:
          0.107914 = score(doc=686,freq=2.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.39037234 = fieldWeight in 686, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.046875 = fieldNorm(doc=686)
        0.047428668 = product of:
          0.094857335 = sum of:
            0.094857335 = weight(_text_:assessment in 686) [ClassicSimilarity], result of:
              0.094857335 = score(doc=686,freq=2.0), product of:
                0.25917634 = queryWeight, product of:
                  5.52102 = idf(docFreq=480, maxDocs=44218)
                  0.04694356 = queryNorm
                0.36599535 = fieldWeight in 686, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  5.52102 = idf(docFreq=480, maxDocs=44218)
                  0.046875 = fieldNorm(doc=686)
          0.5 = coord(1/2)
      0.75 = coord(3/4)
    
    Abstract
    Journals in the Information Science & Library Science category of Journal Citation Reports (JCR) were compared using both bibliometric and bibliographic features. Data collected covered journal impact factor (JIF), number of issues per year, number of authors per article, longevity, editorial board membership, frequency of publication, number of databases indexing the journal, number of aggregators providing full-text access, country of publication, JCR categories, Dewey decimal classification, and journal statement of scope. Three features significantly correlated with JIF: number of editorial board members and number of JCR categories in which a journal is listed correlated positively; journal longevity correlated negatively with JIF. Coword analysis of journal descriptions provided a proximity clustering of journals, which differed considerably from the clusters based on editorial board membership. Finally, a multiple linear regression model was built to predict the JIF based on all the collected bibliographic features.
  20. Brandão, W.C.; Santos, R.L.T.; Ziviani, N.; Moura, E.S. de; Silva, A.S. da: Learning to expand queries using entities (2014) 0.12
    0.12171714 = product of:
      0.16228952 = sum of:
        0.056460675 = weight(_text_:term in 1343) [ClassicSimilarity], result of:
          0.056460675 = score(doc=1343,freq=2.0), product of:
            0.21904005 = queryWeight, product of:
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.04694356 = queryNorm
            0.25776416 = fieldWeight in 1343, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.66603 = idf(docFreq=1130, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1343)
        0.08992833 = weight(_text_:frequency in 1343) [ClassicSimilarity], result of:
          0.08992833 = score(doc=1343,freq=2.0), product of:
            0.27643865 = queryWeight, product of:
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.04694356 = queryNorm
            0.32531026 = fieldWeight in 1343, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.888745 = idf(docFreq=332, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1343)
        0.015900511 = product of:
          0.031801023 = sum of:
            0.031801023 = weight(_text_:22 in 1343) [ClassicSimilarity], result of:
              0.031801023 = score(doc=1343,freq=2.0), product of:
                0.16438834 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04694356 = queryNorm
                0.19345059 = fieldWeight in 1343, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1343)
          0.5 = coord(1/2)
      0.75 = coord(3/4)
    
    Abstract
    A substantial fraction of web search queries contain references to entities, such as persons, organizations, and locations. Recently, methods that exploit named entities have been shown to be more effective for query expansion than traditional pseudorelevance feedback methods. In this article, we introduce a supervised learning approach that exploits named entities for query expansion using Wikipedia as a repository of high-quality feedback documents. In contrast with existing entity-oriented pseudorelevance feedback approaches, we tackle query expansion as a learning-to-rank problem. As a result, not only do we select effective expansion terms but we also weigh these terms according to their predicted effectiveness. To this end, we exploit the rich structure of Wikipedia articles to devise discriminative term features, including each candidate term's proximity to the original query terms, as well as its frequency across multiple article fields and in category and infobox descriptors. Experiments on three Text REtrieval Conference web test collections attest the effectiveness of our approach, with gains of up to 23.32% in terms of mean average precision, 19.49% in terms of precision at 10, and 7.86% in terms of normalized discounted cumulative gain compared with a state-of-the-art approach for entity-oriented query expansion.
    Date
    22. 8.2014 17:07:50

Languages

  • e 2041
  • d 202
  • f 2
  • i 2
  • a 1
  • hu 1
  • m 1
  • pt 1
  • sp 1
  • More… Less…

Types

  • a 2054
  • el 185
  • m 107
  • s 44
  • x 23
  • r 10
  • b 5
  • i 1
  • p 1
  • z 1
  • More… Less…

Themes

Subjects

Classifications