Search (3680 results, page 2 of 184)

  1. #1387 0.05
    0.05218774 = product of:
      0.2609387 = sum of:
        0.2609387 = weight(_text_:22 in 1386) [ClassicSimilarity], result of:
          0.2609387 = score(doc=1386,freq=4.0), product of:
            0.17031991 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04863741 = queryNorm
            1.5320505 = fieldWeight in 1386, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.21875 = fieldNorm(doc=1386)
      0.2 = coord(1/5)
    
    Date
    22. 5.1998 20:02:22
  2. #2103 0.05
    0.05218774 = product of:
      0.2609387 = sum of:
        0.2609387 = weight(_text_:22 in 2102) [ClassicSimilarity], result of:
          0.2609387 = score(doc=2102,freq=4.0), product of:
            0.17031991 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.04863741 = queryNorm
            1.5320505 = fieldWeight in 2102, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.21875 = fieldNorm(doc=2102)
      0.2 = coord(1/5)
    
    Date
    22. 5.1998 20:02:22
  3. Yang, C.C.; Li, K.W.: ¬A heuristic method based on a statistical approach for chinese text segmentation (2005) 0.05
    0.04936115 = product of:
      0.24680576 = sum of:
        0.24680576 = weight(_text_:grams in 4580) [ClassicSimilarity], result of:
          0.24680576 = score(doc=4580,freq=4.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.62963295 = fieldWeight in 4580, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4580)
      0.2 = coord(1/5)
    
    Abstract
    The authors propose a heuristic method for Chinese automatic text segmentation based an a statistical approach. This method is developed based an statistical information about the association among adjacent characters in Chinese text. Mutual information of bi-grams and significant estimation of tri-grams are utilized. A heuristic method with six rules is then proposed to determine the segmentation points in a Chinese sentence. No dictionary is required in this method. Chinese text segmentation is important in Chinese text indexing and thus greatly affects the performance of Chinese information retrieval. Due to the lack of delimiters of words in Chinese text, Chinese text segmentation is more difficult than English text segmentation. Besides, segmentation ambiguities and occurrences of out-of-vocabulary words (i.e., unknown words) are the major challenges in Chinese segmentation. Many research studies dealing with the problem of word segmentation have focused an the resolution of segmentation ambiguities. The problem of unknown word identification has not drawn much attention. The experimental result Shows that the proposed heuristic method is promising to segment the unknown words as weIl as the known words. The authors further investigated the distribution of the errors of commission and the errors of omission caused by the proposed heuristic method and benchmarked the proposed heuristic method with a previous proposed technique, boundary detection. It is found that the heuristic method outperformed the boundary detection method.
  4. Morato, J.; Llorens, J.; Genova, G.; Moreiro, J.A.: Experiments in discourse analysis impact on information classification and retrieval algorithms (2003) 0.05
    0.04936115 = product of:
      0.24680576 = sum of:
        0.24680576 = weight(_text_:grams in 1083) [ClassicSimilarity], result of:
          0.24680576 = score(doc=1083,freq=4.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.62963295 = fieldWeight in 1083, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1083)
      0.2 = coord(1/5)
    
    Abstract
    Researchers in indexing and retrieval systems have been advocating the inclusion of more contextual information to improve results. The proliferation of full-text databases and advances in computer storage capacity have made it possible to carry out text analysis by means of linguistic and extra-linguistic knowledge. Since the mid 80s, research has tended to pay more attention to context, giving discourse analysis a more central role. The research presented in this paper aims to check whether discourse variables have an impact on modern information retrieval and classification algorithms. In order to evaluate this hypothesis, a functional framework for information analysis in an automated environment has been proposed, where the n-grams (filtering) and the k-means and Chen's classification algorithms have been tested against sub-collections of documents based on the following discourse variables: "Genre", "Register", "Domain terminology", and "Document structure". The results obtained with the algorithms for the different sub-collections were compared to the MeSH information structure. These demonstrate that n-grams does not appear to have a clear dependence on discourse variables, though the k-means classification algorithm does, but only on domain terminology and document structure, and finally Chen's algorithm has a clear dependence on all of the discourse variables. This information could be used to design better classification algorithms, where discourse variables should be taken into account. Other minor conclusions drawn from these results are also presented.
  5. Egghe, L.; Ravichandra Rao, I.K.: ¬The influence of the broadness of a query of a topic on its h-index : models and examples of the h-index of n-grams (2008) 0.05
    0.04936115 = product of:
      0.24680576 = sum of:
        0.24680576 = weight(_text_:grams in 2009) [ClassicSimilarity], result of:
          0.24680576 = score(doc=2009,freq=4.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.62963295 = fieldWeight in 2009, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2009)
      0.2 = coord(1/5)
    
    Abstract
    The article studies the influence of the query formulation of a topic on its h-index. In order to generate pure random sets of documents, we used N-grams (N variable) to measure this influence: strings of zeros, truncated at the end. The used databases are WoS and Scopus. The formula h=T**1/alpha, proved in Egghe and Rousseau (2006) where T is the number of retrieved documents and is Lotka's exponent, is confirmed being a concavely increasing function of T. We also give a formula for the relation between h and N the length of the N-gram: h=D10**(-N/alpha) where D is a constant, a convexly decreasing function, which is found in our experiments. Nonlinear regression on h=T**1/alpha gives an estimation of , which can then be used to estimate the h-index of the entire database (Web of Science [WoS] and Scopus): h=S**1/alpha, , where S is the total number of documents in the database.
  6. Dang, E.K.F.; Luk, R.W.P.; Allan, J.: Beyond bag-of-words : bigram-enhanced context-dependent term weights (2014) 0.05
    0.04936115 = product of:
      0.24680576 = sum of:
        0.24680576 = weight(_text_:grams in 1283) [ClassicSimilarity], result of:
          0.24680576 = score(doc=1283,freq=4.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.62963295 = fieldWeight in 1283, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1283)
      0.2 = coord(1/5)
    
    Abstract
    While term independence is a widely held assumption in most of the established information retrieval approaches, it is clearly not true and various works in the past have investigated a relaxation of the assumption. One approach is to use n-grams in document representation instead of unigrams. However, the majority of early works on n-grams obtained only modest performance improvement. On the other hand, the use of information based on supporting terms or "contexts" of queries has been found to be promising. In particular, recent studies showed that using new context-dependent term weights improved the performance of relevance feedback (RF) retrieval compared with using traditional bag-of-words BM25 term weights. Calculation of the new term weights requires an estimation of the local probability of relevance of each query term occurrence. In previous studies, the estimation of this probability was based on unigrams that occur in the neighborhood of a query term. We explore an integration of the n-gram and context approaches by computing context-dependent term weights based on a mixture of unigrams and bigrams. Extensive experiments are performed using the title queries of the Text Retrieval Conference (TREC)-6, TREC-7, TREC-8, and TREC-2005 collections, for RF with relevance judgment of either the top 10 or top 20 documents of an initial retrieval. We identify some crucial elements needed in the use of bigrams in our methods, such as proper inverse document frequency (IDF) weighting of the bigrams and noise reduction by pruning bigrams with large document frequency values. We show that enhancing context-dependent term weights with bigrams is effective in further improving retrieval performance.
  7. Cohen, J.D.: Highlights: language- and domain-independent automatic indexing terms for abstracting (1995) 0.05
    0.048865046 = product of:
      0.24432522 = sum of:
        0.24432522 = weight(_text_:grams in 1793) [ClassicSimilarity], result of:
          0.24432522 = score(doc=1793,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.6233048 = fieldWeight in 1793, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1793)
      0.2 = coord(1/5)
    
    Abstract
    Presents a model of drawing index terms from text. The approach uses no stop list, stemmer, or other language and domain specific component, allowing operation in any language or domain with only trivial modification. The method uses n-grams counts, achieving a function similar to, but more general than, a stemmer. The generated index terms, called 'highlights', are suitable for identifying the topic for perusal and selection. An extension is also described and demonstrated which selects index terms to represent a subset of documents, distinguishing them from the corpus. Presents some experimental results, showing operation in English, Spanish, German, Georgian, Russian and Japanese
  8. Fox, K.L.; Frieder, O.; Knepper, M.M.; Snowberg, E.J.: SENTINEL: a multiple engine information retrieval and visualization system (1999) 0.05
    0.048865046 = product of:
      0.24432522 = sum of:
        0.24432522 = weight(_text_:grams in 3547) [ClassicSimilarity], result of:
          0.24432522 = score(doc=3547,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.6233048 = fieldWeight in 3547, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0546875 = fieldNorm(doc=3547)
      0.2 = coord(1/5)
    
    Abstract
    We describe a prototype Information Retrieval system; SENTINEL, under development at Harris Corporation's Information Systems Division. SENTINEL is a fusion of multiple information retrieval technologies, integrating n-grams, a vector space model, and a neural network training rule. One of the primary advantages of SENTINEL is its 3-dimensional visualization capability that is based fully upon the mathematical representation of information with SENTINEL. The 3-dimensional visualization capability provides users with an intuitive understanding, with relevance/query refinement techniques athat can be better utilized, resulting in higher retrieval precision
  9. Wordhoard (o.J.) 0.05
    0.048865046 = product of:
      0.24432522 = sum of:
        0.24432522 = weight(_text_:grams in 3922) [ClassicSimilarity], result of:
          0.24432522 = score(doc=3922,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.6233048 = fieldWeight in 3922, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0546875 = fieldNorm(doc=3922)
      0.2 = coord(1/5)
    
    Abstract
    WordHoard defines a multiword unit as a special type of collocate in which the component words comprise a meaningful phrase. For example, "Knight of the Round Table" is a meaningful multiword unit or phrase. WordHoard uses the notion of a pseudo-bigram to generalize the computation of bigram (two word) statistical measures to phrases (n-grams) longer than two words, and to allow comparisons of these measures for phrases with different word counts. WordHoard applies the localmaxs algorithm of Silva et al. to the pseudo-bigrams to identify potential compositional phrases that "stand out" in a text. WordHoard can also filter two and three word phrases using the word class filters suggested by Justeson and Katz.
  10. WordHoard: finding multiword units (20??) 0.05
    0.048865046 = product of:
      0.24432522 = sum of:
        0.24432522 = weight(_text_:grams in 1123) [ClassicSimilarity], result of:
          0.24432522 = score(doc=1123,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.6233048 = fieldWeight in 1123, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1123)
      0.2 = coord(1/5)
    
    Abstract
    WordHoard defines a multiword unit as a special type of collocate in which the component words comprise a meaningful phrase. For example, "Knight of the Round Table" is a meaningful multiword unit or phrase. WordHoard uses the notion of a pseudo-bigram to generalize the computation of bigram (two word) statistical measures to phrases (n-grams) longer than two words, and to allow comparisons of these measures for phrases with different word counts. WordHoard applies the localmaxs algorithm of Silva et al. to the pseudo-bigrams to identify potential compositional phrases that "stand out" in a text. WordHoard can also filter two and three word phrases using the word class filters suggested by Justeson and Katz.
  11. Mas, S.; Marleau, Y.: Proposition of a faceted classification model to support corporate information organization and digital records management (2009) 0.05
    0.04634946 = product of:
      0.23174728 = sum of:
        0.23174728 = weight(_text_:3a in 2918) [ClassicSimilarity], result of:
          0.23174728 = score(doc=2918,freq=2.0), product of:
            0.41234848 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04863741 = queryNorm
            0.56201804 = fieldWeight in 2918, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=2918)
      0.2 = coord(1/5)
    
    Footnote
    Vgl.: http://ieeexplore.ieee.org/Xplore/login.jsp?reload=true&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F4755313%2F4755314%2F04755480.pdf%3Farnumber%3D4755480&authDecision=-203.
  12. Li, L.; Shang, Y.; Zhang, W.: Improvement of HITS-based algorithms on Web documents 0.05
    0.04634946 = product of:
      0.23174728 = sum of:
        0.23174728 = weight(_text_:3a in 2514) [ClassicSimilarity], result of:
          0.23174728 = score(doc=2514,freq=2.0), product of:
            0.41234848 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04863741 = queryNorm
            0.56201804 = fieldWeight in 2514, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=2514)
      0.2 = coord(1/5)
    
    Content
    Vgl.: http%3A%2F%2Fdelab.csd.auth.gr%2F~dimitris%2Fcourses%2Fir_spring06%2Fpage_rank_computing%2Fp527-li.pdf. Vgl. auch: http://www2002.org/CDROM/refereed/643/.
  13. Zeng, Q.; Yu, M.; Yu, W.; Xiong, J.; Shi, Y.; Jiang, M.: Faceted hierarchy : a new graph type to organize scientific concepts and a construction method (2019) 0.05
    0.04634946 = product of:
      0.23174728 = sum of:
        0.23174728 = weight(_text_:3a in 400) [ClassicSimilarity], result of:
          0.23174728 = score(doc=400,freq=2.0), product of:
            0.41234848 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04863741 = queryNorm
            0.56201804 = fieldWeight in 400, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=400)
      0.2 = coord(1/5)
    
    Content
    Vgl.: https%3A%2F%2Faclanthology.org%2FD19-5317.pdf&usg=AOvVaw0ZZFyq5wWTtNTvNkrvjlGA.
  14. Suchenwirth, L.: Sacherschliessung in Zeiten von Corona : neue Herausforderungen und Chancen (2019) 0.05
    0.04634946 = product of:
      0.23174728 = sum of:
        0.23174728 = weight(_text_:3a in 484) [ClassicSimilarity], result of:
          0.23174728 = score(doc=484,freq=2.0), product of:
            0.41234848 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04863741 = queryNorm
            0.56201804 = fieldWeight in 484, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=484)
      0.2 = coord(1/5)
    
    Footnote
    https%3A%2F%2Fjournals.univie.ac.at%2Findex.php%2Fvoebm%2Farticle%2Fdownload%2F5332%2F5271%2F&usg=AOvVaw2yQdFGHlmOwVls7ANCpTii.
  15. Noever, D.; Ciolino, M.: ¬The Turing deception (2022) 0.05
    0.04634946 = product of:
      0.23174728 = sum of:
        0.23174728 = weight(_text_:3a in 862) [ClassicSimilarity], result of:
          0.23174728 = score(doc=862,freq=2.0), product of:
            0.41234848 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.04863741 = queryNorm
            0.56201804 = fieldWeight in 862, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=862)
      0.2 = coord(1/5)
    
    Source
    https%3A%2F%2Farxiv.org%2Fabs%2F2212.06721&usg=AOvVaw3i_9pZm9y_dQWoHi6uv0EN
  16. Pearce, C.; Nicholas, C.: TELLTALE: Experiments in a dynamic hypertext environment for degraded and multilingual data (1996) 0.04
    0.04188432 = product of:
      0.2094216 = sum of:
        0.2094216 = weight(_text_:grams in 4071) [ClassicSimilarity], result of:
          0.2094216 = score(doc=4071,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.5342612 = fieldWeight in 4071, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.046875 = fieldNorm(doc=4071)
      0.2 = coord(1/5)
    
    Abstract
    Methods and tools for finding documents relevant to a user's needs in a document corpora can be found in the information retrieval, library science, and hypertext communities. Typically, these systems provide retrieval capabilities for fairly static copora, their algorithms are dependent on the language for which they are written, e.g. English, and they do not perform well when presented with misspelled words or text that has been degraded by OCR techniques. In this article, we present experimentation results for the TELLTALE system. TELLTALE is a dynamic hypertext environment that provides full-text search from a hypertext-style user interface for text corpora that may be garbled by OCR or transmission errors, and that may contain languages other than English. TELLTALE uses several techniques based on n-grams (n character sequences of text). With these results we show that the dynamic linkage mechanisms in TELLTALE are tolerant of garbles in up to 30% of the characters in the body of the texts
  17. Haas, S.W.; Grams, E.S.: Readers, authors, and page structure : a discussion of four questions arising from a content analysis of Web pages (2000) 0.04
    0.04188432 = product of:
      0.2094216 = sum of:
        0.2094216 = weight(_text_:grams in 4387) [ClassicSimilarity], result of:
          0.2094216 = score(doc=4387,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.5342612 = fieldWeight in 4387, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.046875 = fieldNorm(doc=4387)
      0.2 = coord(1/5)
    
  18. Westerman, S.J.; Cribbin, T.; Collins, J.: Human assessments of document similarity (2010) 0.04
    0.04188432 = product of:
      0.2094216 = sum of:
        0.2094216 = weight(_text_:grams in 3915) [ClassicSimilarity], result of:
          0.2094216 = score(doc=3915,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.5342612 = fieldWeight in 3915, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.046875 = fieldNorm(doc=3915)
      0.2 = coord(1/5)
    
    Object
    n-grams
  19. Teich, E.; Degaetano-Ortlieb, S.; Fankhauser, P.; Kermes, H.; Lapshinova-Koltunski, E.: ¬The linguistic construal of disciplinarity : a data-mining approach using register features (2016) 0.04
    0.04188432 = product of:
      0.2094216 = sum of:
        0.2094216 = weight(_text_:grams in 3015) [ClassicSimilarity], result of:
          0.2094216 = score(doc=3015,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.5342612 = fieldWeight in 3015, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.046875 = fieldNorm(doc=3015)
      0.2 = coord(1/5)
    
    Abstract
    We analyze the linguistic evolution of selected scientific disciplines over a 30-year time span (1970s to 2000s). Our focus is on four highly specialized disciplines at the boundaries of computer science that emerged during that time: computational linguistics, bioinformatics, digital construction, and microelectronics. Our analysis is driven by the question whether these disciplines develop a distinctive language use-both individually and collectively-over the given time period. The data set is the English Scientific Text Corpus (scitex), which includes texts from the 1970s/1980s and early 2000s. Our theoretical basis is register theory. In terms of methods, we combine corpus-based methods of feature extraction (various aggregated features [part-of-speech based], n-grams, lexico-grammatical patterns) and automatic text classification. The results of our research are directly relevant to the study of linguistic variation and languages for specific purposes (LSP) and have implications for various natural language processing (NLP) tasks, for example, authorship attribution, text mining, or training NLP tools.
  20. Lin, Y.-R.; Margolin, D.; Lazer, D.: Uncovering social semantics from textual traces : a theory-driven approach and evidence from public statements of U.S. Members of Congress (2016) 0.04
    0.04188432 = product of:
      0.2094216 = sum of:
        0.2094216 = weight(_text_:grams in 3078) [ClassicSimilarity], result of:
          0.2094216 = score(doc=3078,freq=2.0), product of:
            0.39198354 = queryWeight, product of:
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.04863741 = queryNorm
            0.5342612 = fieldWeight in 3078, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.059301 = idf(docFreq=37, maxDocs=44218)
              0.046875 = fieldNorm(doc=3078)
      0.2 = coord(1/5)
    
    Abstract
    The increasing abundance of digital textual archives provides an opportunity for understanding human social systems. Yet the literature has not adequately considered the disparate social processes by which texts are produced. Drawing on communication theory, we identify three common processes by which documents might be detectably similar in their textual features-authors sharing subject matter, sharing goals, and sharing sources. We hypothesize that these processes produce distinct, detectable relationships between authors in different kinds of textual overlap. We develop a novel n-gram extraction technique to capture such signatures based on n-grams of different lengths. We test the hypothesis on a corpus where the author attributes are observable: the public statements of the members of the U.S. Congress. This article presents the first empirical finding that shows different social relationships are detectable through the structure of overlapping textual features. Our study has important implications for designing text modeling techniques to make sense of social phenomena from aggregate digital traces.

Languages

Types

  • a 3079
  • m 344
  • el 164
  • s 139
  • b 39
  • x 35
  • i 23
  • r 18
  • ? 8
  • p 4
  • d 3
  • n 3
  • u 2
  • z 2
  • au 1
  • h 1
  • More… Less…

Themes

Subjects

Classifications