Search (3079 results, page 1 of 154)

  • × type_ss:"a"
  1. Chau, M.; Lu, Y.; Fang, X.; Yang, C.C.: Characteristics of character usage in Chinese Web searching (2009) 0.13
    0.13408904 = product of:
      0.33522257 = sum of:
        0.30227408 = weight(_text_:grams in 2456) [ClassicSimilarity], result of:
          0.30227408 = score(doc=2456,freq=6.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.77113974 = fieldWeight in 2456, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2456)
        0.032948487 = weight(_text_:22 in 2456) [ClassicSimilarity], result of:
          0.032948487 = score(doc=2456,freq=2.0), product of:
            0.17031991 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04863741 = queryNorm
            0.19345059 = fieldWeight in 2456, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2456)
      0.4 = coord(2/5)
    
    Abstract
    The use of non-English Web search engines has been prevalent. Given the popularity of Chinese Web searching and the unique characteristics of Chinese language, it is imperative to conduct studies with focuses on the analysis of Chinese Web search queries. In this paper, we report our research on the character usage of Chinese search logs from a Web search engine in Hong Kong. By examining the distribution of search query terms, we found that users tended to use more diversified terms and that the usage of characters in search queries was quite different from the character usage of general online information in Chinese. After studying the Zipf distribution of n-grams with different values of n, we found that the curve of unigram is the most curved one of all while the bigram curve follows the Zipf distribution best, and that the curves of n-grams with larger n (n = 3-6) had similar structures with ?-values in the range of 0.66-0.86. The distribution of combined n-grams was also studied. All the analyses are performed on the data both before and after the removal of function terms and incomplete terms and similar findings are revealed. We believe the findings from this study have provided some insights into further research in non-English Web searching and will assist in the design of more effective Chinese Web search engines.
    Date
    22.11.2008 17:57:22
  2. Robertson, A.M.; Willett, P.: Applications of n-grams in textual information systems (1998) 0.11
    0.111691535 = product of:
      0.5584577 = sum of:
        0.5584577 = weight(_text_:grams in 4715) [ClassicSimilarity], result of:
          0.5584577 = score(doc=4715,freq=8.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            1.4246967 = fieldWeight in 4715, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0625 = fieldNorm(doc=4715)
      0.2 = coord(1/5)
    
    Abstract
    Provides an introduction to the use of n-grams in textual information systems, where an n-gram is a string of n, usually adjacent, characters, extracted from a section of continuous text. Applications that can be implemented efficiently and effectively using sets of n-grams include spelling errors detection and correction, query expansion, information retrieval with serial, inverted and signature files, dictionary look up, text compression, and language identification
    Object
    n-grams
  3. Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.11
    0.10851419 = product of:
      0.27128547 = sum of:
        0.23174728 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
          0.23174728 = score(doc=562,freq=2.0), product of:
            0.41234848 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04863741 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.039538182 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
          0.039538182 = score(doc=562,freq=2.0), product of:
            0.17031991 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04863741 = queryNorm
            0.23214069 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.4 = coord(2/5)
    
    Content
    Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
    Date
    8. 1.2013 10:22:32
  4. Huffman, S.: Acquaintance : language-independent document categorization by n-grams (1996) 0.10
    0.09773009 = product of:
      0.48865044 = sum of:
        0.48865044 = weight(_text_:grams in 7530) [ClassicSimilarity], result of:
          0.48865044 = score(doc=7530,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            1.2466096 = fieldWeight in 7530, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.109375 = fieldNorm(doc=7530)
      0.2 = coord(1/5)
    
  5. Figuerola, C.G.; Gomez, R.; Lopez de San Roman, E.: Stemming and n-grams in Spanish : an evaluation of their impact in information retrieval (2000) 0.08
    0.08376864 = product of:
      0.4188432 = sum of:
        0.4188432 = weight(_text_:grams in 6501) [ClassicSimilarity], result of:
          0.4188432 = score(doc=6501,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            1.0685225 = fieldWeight in 6501, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.09375 = fieldNorm(doc=6501)
      0.2 = coord(1/5)
    
  6. Stamatatos, E.: Plagiarism detection using stopword n-grams (2011) 0.07
    0.07254578 = product of:
      0.3627289 = sum of:
        0.3627289 = weight(_text_:grams in 4955) [ClassicSimilarity], result of:
          0.3627289 = score(doc=4955,freq=6.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.92536765 = fieldWeight in 4955, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.046875 = fieldNorm(doc=4955)
      0.2 = coord(1/5)
    
    Abstract
    In this paper a novel method for detecting plagiarized passages in document collections is presented. In contrast to previous work in this field that uses content terms to represent documents, the proposed method is based on a small list of stopwords (i.e., very frequent words). We show that stopword n-grams reveal important information for plagiarism detection since they are able to capture syntactic similarities between suspicious and original documents and they can be used to detect the exact plagiarized passage boundaries. Experimental results on a publicly available corpus demonstrate that the performance of the proposed approach is competitive when compared with the best reported results. More importantly, it achieves significantly better results when dealing with difficult plagiarism cases where the plagiarized passages are highly modified and most of the words or phrases have been replaced with synonyms.
    Object
    n-grams
  7. Khoo, C.S.G.; Dai, D.; Loh, T.E.: Using statistical and contextual information to identify two- and three-character words in Chinese text (2002) 0.07
    0.06980721 = product of:
      0.34903604 = sum of:
        0.34903604 = weight(_text_:grams in 5206) [ClassicSimilarity], result of:
          0.34903604 = score(doc=5206,freq=8.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.89043546 = fieldWeight in 5206, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5206)
      0.2 = coord(1/5)
    
    Abstract
    Khoo, Dai, and Loh examine new statistical methods for the identification of two and three character words in Chinese text. Some meaningful Chinese words are simple (independent units of one or more characters in a sentence that have independent meaning) but others are compounds of two or more simple words. In their segmentation they utilize the Modern Chinese Word Segmentation for Application of Information Processing, with some modifications to focus on meaningful words to do manual segmentation. About 37% of meaningful words are longer than 2 characters indicating a need to handle three and four character words. Four hundred sentences from news articles were manually broken into overlapping bi-grams and tri-grams. Using logistic regression, the log of the odds that such bi/tri-grams were meaningful words was calculated. Variables like relative frequency, document frequency, local frequency, and contextual and positional information, were incorporated in the model only if the concordance measure improved by at least 2% with their addition. For two- and three-character words relative frequency of adjacent characters and document frequency of overlapping bi-grams were found to be significant. Using measures of recall and precision where correct automatic segmentation is normalized either by manual segmentation or by automatic segmentation, the contextual information formula for 2 character words provides significantly better results than previous formulations and using both the 2 and 3 character formulations in combination significantly improves the 2 character results.
  8. Chen, L.; Fang, H.: ¬An automatic method for ex-tracting innovative ideas based on the Scopus® database (2019) 0.07
    0.06980721 = product of:
      0.34903604 = sum of:
        0.34903604 = weight(_text_:grams in 5310) [ClassicSimilarity], result of:
          0.34903604 = score(doc=5310,freq=8.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.89043546 = fieldWeight in 5310, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5310)
      0.2 = coord(1/5)
    
    Abstract
    The novelty of knowledge claims in a research paper can be considered an evaluation criterion for papers to supplement citations. To provide a foundation for research evaluation from the perspective of innovativeness, we propose an automatic approach for extracting innovative ideas from the abstracts of technology and engineering papers. The approach extracts N-grams as candidates based on part-of-speech tagging and determines whether they are novel by checking the Scopus® database to determine whether they had ever been presented previously. Moreover, we discussed the distributions of innovative ideas in different abstract structures. To improve the performance by excluding noisy N-grams, a list of stopwords and a list of research description characteristics were developed. We selected abstracts of articles published from 2011 to 2017 with the topic of semantic analysis as the experimental texts. Excluding noisy N-grams, considering the distribution of innovative ideas in abstracts, and suitably combining N-grams can effectively improve the performance of automatic innovative idea extraction. Unlike co-word and co-citation analysis, innovative-idea extraction aims to identify the differences in a paper from all previously published papers.
  9. Mustafa, S.H.; AI-Radaideh, Q.A.: Using n-grams for Arabic text searching (2004) 0.07
    0.06910561 = product of:
      0.34552804 = sum of:
        0.34552804 = weight(_text_:grams in 2888) [ClassicSimilarity], result of:
          0.34552804 = score(doc=2888,freq=4.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.88148606 = fieldWeight in 2888, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2888)
      0.2 = coord(1/5)
    
    Abstract
    N-grams have been widely investigated for a number of text processing and retrieval applications. This article examines the performance of the digram and trigram term conflation techniques in the context of Arabic free text retrieval. It reports the results of using the N-gram approach for a corpus of thousands of distinct textual words drawn from a number of sources representing various disciplines. The results indicate that the digram method offers a better performance than trigram with respect to conflation precision and conflation recall ratios. In either case, the N-gram approach does not appear to provide an efficient conflation approach due to the peculiarities imposed by the Arabic infix structure that reduces the rate of correct N-gram matching.
  10. Schrodt, R.: Tiefen und Untiefen im wissenschaftlichen Sprachgebrauch (2008) 0.06
    0.061799277 = product of:
      0.30899638 = sum of:
        0.30899638 = weight(_text_:3a in 140) [ClassicSimilarity], result of:
          0.30899638 = score(doc=140,freq=2.0), product of:
            0.41234848 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04863741 = queryNorm
            0.7493574 = fieldWeight in 140, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.0625 = fieldNorm(doc=140)
      0.2 = coord(1/5)
    
    Content
    Vgl. auch: https://studylibde.com/doc/13053640/richard-schrodt. Vgl. auch: http%3A%2F%2Fwww.univie.ac.at%2FGermanistik%2Fschrodt%2Fvorlesung%2Fwissenschaftssprache.doc&usg=AOvVaw1lDLDR6NFf1W0-oC9mEUJf.
  11. Popper, K.R.: Three worlds : the Tanner lecture on human values. Deliverd at the University of Michigan, April 7, 1978 (1978) 0.06
    0.061799277 = product of:
      0.30899638 = sum of:
        0.30899638 = weight(_text_:3a in 230) [ClassicSimilarity], result of:
          0.30899638 = score(doc=230,freq=2.0), product of:
            0.41234848 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04863741 = queryNorm
            0.7493574 = fieldWeight in 230, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.0625 = fieldNorm(doc=230)
      0.2 = coord(1/5)
    
    Source
    https%3A%2F%2Ftannerlectures.utah.edu%2F_documents%2Fa-to-z%2Fp%2Fpopper80.pdf&usg=AOvVaw3f4QRTEH-OEBmoYr2J_c7H
  12. Egghe, L.: Properties of the n-overlap vector and n-overlap similarity theory (2006) 0.06
    0.060454816 = product of:
      0.30227408 = sum of:
        0.30227408 = weight(_text_:grams in 194) [ClassicSimilarity], result of:
          0.30227408 = score(doc=194,freq=6.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.77113974 = fieldWeight in 194, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0390625 = fieldNorm(doc=194)
      0.2 = coord(1/5)
    
    Abstract
    In the first part of this article the author defines the n-overlap vector whose coordinates consist of the fraction of the objects (e.g., books, N-grams, etc.) that belong to 1, 2, , n sets (more generally: families) (e.g., libraries, databases, etc.). With the aid of the Lorenz concentration theory, a theory of n-overlap similarity is conceived together with corresponding measures, such as the generalized Jaccard index (generalizing the well-known Jaccard index in case n 5 2). Next, the distributional form of the n-overlap vector is determined assuming certain distributions of the object's and of the set (family) sizes. In this section the decreasing power law and decreasing exponential distribution is explained for the n-overlap vector. Both item (token) n-overlap and source (type) n-overlap are studied. The n-overlap properties of objects indexed by a hierarchical system (e.g., books indexed by numbers from a UDC or Dewey system or by N-grams) are presented in the final section. The author shows how the results given in the previous section can be applied as well as how the Lorenz order of the n-overlap vector is respected by an increase or a decrease of the level of refinement in the hierarchical system (e.g., the value N in N-grams).
  13. Carterette, B.; Can, F.: Comparing inverted files and signature files for searching a large lexicon (2005) 0.06
    0.05923338 = product of:
      0.2961669 = sum of:
        0.2961669 = weight(_text_:grams in 1029) [ClassicSimilarity], result of:
          0.2961669 = score(doc=1029,freq=4.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.7555595 = fieldWeight in 1029, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.046875 = fieldNorm(doc=1029)
      0.2 = coord(1/5)
    
    Abstract
    Signature files and inverted files are well-known index structures. In this paper we undertake a direct comparison of the two for searching for partially-specified queries in a large lexicon stored in main memory. Using n-grams to index lexicon terms, a bit-sliced signature file can be compressed to a smaller size than an inverted file if each n-gram sets only one bit in the term signature. With a signature width less than half the number of unique n-grams in the lexicon, the signature file method is about as fast as the inverted file method, and significantly smaller. Greater flexibility in memory usage and faster index generation time make signature files appropriate for searching large lexicons or other collections in an environment where memory is at a premium.
  14. Ahmed, F.; Nürnberger, A.: Evaluation of n-gram conflation approaches for Arabic text retrieval (2009) 0.06
    0.05923338 = product of:
      0.2961669 = sum of:
        0.2961669 = weight(_text_:grams in 2941) [ClassicSimilarity], result of:
          0.2961669 = score(doc=2941,freq=4.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.7555595 = fieldWeight in 2941, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.046875 = fieldNorm(doc=2941)
      0.2 = coord(1/5)
    
    Abstract
    In this paper we present a language-independent approach for conflation that does not depend on predefined rules or prior knowledge of the target language. The proposed unsupervised method is based on an enhancement of the pure n-gram model that can group related words based on various string-similarity measures, while restricting the search to specific locations of the target word by taking into account the order of n-grams. We show that the method is effective to achieve high score similarities for all word-form variations and reduces the ambiguity, i.e., obtains a higher precision and recall, compared to pure n-gram-based approaches for English, Portuguese, and Arabic. The proposed method is especially suited for conflation approaches in Arabic, since Arabic is a highly inflectional language. Therefore, we present in addition an adaptive user interface for Arabic text retrieval called araSearch. araSearch serves as a metasearch interface to existing search engines. The system is able to extend a query using the proposed conflation approach such that additional results for relevant subwords can be found automatically.
    Object
    n-grams
  15. Vetere, G.; Lenzerini, M.: Models for semantic interoperability in service-oriented architectures (2005) 0.05
    0.054074373 = product of:
      0.27037185 = sum of:
        0.27037185 = weight(_text_:3a in 306) [ClassicSimilarity], result of:
          0.27037185 = score(doc=306,freq=2.0), product of:
            0.41234848 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04863741 = queryNorm
            0.65568775 = fieldWeight in 306, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.0546875 = fieldNorm(doc=306)
      0.2 = coord(1/5)
    
    Content
    Vgl.: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5386707&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D5386707.
  16. Yang, C.C.; Li, K.W.: ¬A heuristic method based on a statistical approach for chinese text segmentation (2005) 0.05
    0.04936115 = product of:
      0.24680576 = sum of:
        0.24680576 = weight(_text_:grams in 4580) [ClassicSimilarity], result of:
          0.24680576 = score(doc=4580,freq=4.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.62963295 = fieldWeight in 4580, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4580)
      0.2 = coord(1/5)
    
    Abstract
    The authors propose a heuristic method for Chinese automatic text segmentation based an a statistical approach. This method is developed based an statistical information about the association among adjacent characters in Chinese text. Mutual information of bi-grams and significant estimation of tri-grams are utilized. A heuristic method with six rules is then proposed to determine the segmentation points in a Chinese sentence. No dictionary is required in this method. Chinese text segmentation is important in Chinese text indexing and thus greatly affects the performance of Chinese information retrieval. Due to the lack of delimiters of words in Chinese text, Chinese text segmentation is more difficult than English text segmentation. Besides, segmentation ambiguities and occurrences of out-of-vocabulary words (i.e., unknown words) are the major challenges in Chinese segmentation. Many research studies dealing with the problem of word segmentation have focused an the resolution of segmentation ambiguities. The problem of unknown word identification has not drawn much attention. The experimental result Shows that the proposed heuristic method is promising to segment the unknown words as weIl as the known words. The authors further investigated the distribution of the errors of commission and the errors of omission caused by the proposed heuristic method and benchmarked the proposed heuristic method with a previous proposed technique, boundary detection. It is found that the heuristic method outperformed the boundary detection method.
  17. Morato, J.; Llorens, J.; Genova, G.; Moreiro, J.A.: Experiments in discourse analysis impact on information classification and retrieval algorithms (2003) 0.05
    0.04936115 = product of:
      0.24680576 = sum of:
        0.24680576 = weight(_text_:grams in 1083) [ClassicSimilarity], result of:
          0.24680576 = score(doc=1083,freq=4.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.62963295 = fieldWeight in 1083, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1083)
      0.2 = coord(1/5)
    
    Abstract
    Researchers in indexing and retrieval systems have been advocating the inclusion of more contextual information to improve results. The proliferation of full-text databases and advances in computer storage capacity have made it possible to carry out text analysis by means of linguistic and extra-linguistic knowledge. Since the mid 80s, research has tended to pay more attention to context, giving discourse analysis a more central role. The research presented in this paper aims to check whether discourse variables have an impact on modern information retrieval and classification algorithms. In order to evaluate this hypothesis, a functional framework for information analysis in an automated environment has been proposed, where the n-grams (filtering) and the k-means and Chen's classification algorithms have been tested against sub-collections of documents based on the following discourse variables: "Genre", "Register", "Domain terminology", and "Document structure". The results obtained with the algorithms for the different sub-collections were compared to the MeSH information structure. These demonstrate that n-grams does not appear to have a clear dependence on discourse variables, though the k-means classification algorithm does, but only on domain terminology and document structure, and finally Chen's algorithm has a clear dependence on all of the discourse variables. This information could be used to design better classification algorithms, where discourse variables should be taken into account. Other minor conclusions drawn from these results are also presented.
  18. Egghe, L.; Ravichandra Rao, I.K.: ¬The influence of the broadness of a query of a topic on its h-index : models and examples of the h-index of n-grams (2008) 0.05
    0.04936115 = product of:
      0.24680576 = sum of:
        0.24680576 = weight(_text_:grams in 2009) [ClassicSimilarity], result of:
          0.24680576 = score(doc=2009,freq=4.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.62963295 = fieldWeight in 2009, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2009)
      0.2 = coord(1/5)
    
    Abstract
    The article studies the influence of the query formulation of a topic on its h-index. In order to generate pure random sets of documents, we used N-grams (N variable) to measure this influence: strings of zeros, truncated at the end. The used databases are WoS and Scopus. The formula h=T**1/alpha, proved in Egghe and Rousseau (2006) where T is the number of retrieved documents and is Lotka's exponent, is confirmed being a concavely increasing function of T. We also give a formula for the relation between h and N the length of the N-gram: h=D10**(-N/alpha) where D is a constant, a convexly decreasing function, which is found in our experiments. Nonlinear regression on h=T**1/alpha gives an estimation of , which can then be used to estimate the h-index of the entire database (Web of Science [WoS] and Scopus): h=S**1/alpha, , where S is the total number of documents in the database.
  19. Dang, E.K.F.; Luk, R.W.P.; Allan, J.: Beyond bag-of-words : bigram-enhanced context-dependent term weights (2014) 0.05
    0.04936115 = product of:
      0.24680576 = sum of:
        0.24680576 = weight(_text_:grams in 1283) [ClassicSimilarity], result of:
          0.24680576 = score(doc=1283,freq=4.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.62963295 = fieldWeight in 1283, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1283)
      0.2 = coord(1/5)
    
    Abstract
    While term independence is a widely held assumption in most of the established information retrieval approaches, it is clearly not true and various works in the past have investigated a relaxation of the assumption. One approach is to use n-grams in document representation instead of unigrams. However, the majority of early works on n-grams obtained only modest performance improvement. On the other hand, the use of information based on supporting terms or "contexts" of queries has been found to be promising. In particular, recent studies showed that using new context-dependent term weights improved the performance of relevance feedback (RF) retrieval compared with using traditional bag-of-words BM25 term weights. Calculation of the new term weights requires an estimation of the local probability of relevance of each query term occurrence. In previous studies, the estimation of this probability was based on unigrams that occur in the neighborhood of a query term. We explore an integration of the n-gram and context approaches by computing context-dependent term weights based on a mixture of unigrams and bigrams. Extensive experiments are performed using the title queries of the Text Retrieval Conference (TREC)-6, TREC-7, TREC-8, and TREC-2005 collections, for RF with relevance judgment of either the top 10 or top 20 documents of an initial retrieval. We identify some crucial elements needed in the use of bigrams in our methods, such as proper inverse document frequency (IDF) weighting of the bigrams and noise reduction by pruning bigrams with large document frequency values. We show that enhancing context-dependent term weights with bigrams is effective in further improving retrieval performance.
  20. Cohen, J.D.: Highlights: language- and domain-independent automatic indexing terms for abstracting (1995) 0.05
    0.048865046 = product of:
      0.24432522 = sum of:
        0.24432522 = weight(_text_:grams in 1793) [ClassicSimilarity], result of:
          0.24432522 = score(doc=1793,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.6233048 = fieldWeight in 1793, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1793)
      0.2 = coord(1/5)
    
    Abstract
    Presents a model of drawing index terms from text. The approach uses no stop list, stemmer, or other language and domain specific component, allowing operation in any language or domain with only trivial modification. The method uses n-grams counts, achieving a function similar to, but more general than, a stemmer. The generated index terms, called 'highlights', are suitable for identifying the topic for perusal and selection. An extension is also described and demonstrated which selects index terms to represent a subset of documents, distinguishing them from the corpus. Presents some experimental results, showing operation in English, Spanish, German, Georgian, Russian and Japanese

Languages

Types

  • el 73
  • b 34
  • p 1
  • More… Less…

Themes