Document (#40787)

Author
Kauchak, D.
Leroy, G.
Hogue, A.
Title
Measuring text difficulty using parse-tree frequency
Source
Journal of the Association for Information Science and Technology. 68(2017) no.9, S.2088-2100
Year
2017
Abstract
Text simplification often relies on dated, unproven readability formulas. As an alternative and motivated by the success of term familiarity, we test a complementary measure: grammar familiarity. Grammar familiarity is measured as the frequency of the 3rd level sentence parse tree and is useful for evaluating individual sentences. We created a database of 140K unique 3rd level parse structures by parsing and binning all 5.4M sentences in English Wikipedia. We then calculated the grammar frequencies across the corpus and created 11 frequency bins. We evaluate the measure with a user study and corpus analysis. For the user study, we selected 20 sentences randomly from each bin, controlling for sentence length and term frequency, and recruited 30 readers per sentence (N = 6,600) on Amazon Mechanical Turk. We measured actual difficulty (comprehension) using a Cloze test, perceived difficulty using a 5-point Likert scale, and time taken. Sentences with more frequent grammatical structures, even with very different surface presentations, were easier to understand, perceived as easier, and took less time to read. Outcomes from readability formulas correlated with perceived but not with actual difficulty. Our corpus analysis shows how the metric can be used to understand grammar regularity in a broad range of corpora.
Content
Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23855/full.
Theme
Computerlinguistik

Similar documents (author)

  1. Leroy, G.; Chen, H.: Genescene: an ontology-enhanced integration of linguistic and co-occurrence based relations in biomedical texts (2005) 4.88
    4.8754888 = sum of:
      4.8754888 = weight(author_txt:leroy in 5259) [ClassicSimilarity], result of:
        4.8754888 = fieldWeight in 5259, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.7509775 = idf(docFreq=6, maxDocs=44218)
          0.5 = fieldNorm(doc=5259)
    
  2. Leroy, S.Y.; Thomas, S.L.: Impact of Web access on cataloging (2004) 4.88
    4.8754888 = sum of:
      4.8754888 = weight(author_txt:leroy in 5656) [ClassicSimilarity], result of:
        4.8754888 = fieldWeight in 5656, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.7509775 = idf(docFreq=6, maxDocs=44218)
          0.5 = fieldNorm(doc=5656)
    
  3. Ku, C.-H.; Leroy, G.: ¬A crime reports analysis system to identify related crimes (2011) 4.27
    4.2660527 = sum of:
      4.2660527 = weight(author_txt:leroy in 4629) [ClassicSimilarity], result of:
        4.2660527 = fieldWeight in 4629, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.7509775 = idf(docFreq=6, maxDocs=44218)
          0.4375 = fieldNorm(doc=4629)
    
  4. Leroy, G.; Miller, T.; Rosemblat, G.; Browne, A.: ¬A balanced approach to health information evaluation : a vocabulary-based naïve Bayes classifier and readability formulas (2008) 3.05
    3.0471804 = sum of:
      3.0471804 = weight(author_txt:leroy in 1998) [ClassicSimilarity], result of:
        3.0471804 = fieldWeight in 1998, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.7509775 = idf(docFreq=6, maxDocs=44218)
          0.3125 = fieldNorm(doc=1998)
    
  5. Thirion, B.; Leroy, J.P.; Baudic, F.; Douyère, M.; Piot, J.; Darmoni, S.J.: SDI selecting, decribing, and indexing : did you mean automatically? (2001) 2.44
    2.4377444 = sum of:
      2.4377444 = weight(author_txt:leroy in 6198) [ClassicSimilarity], result of:
        2.4377444 = fieldWeight in 6198, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.7509775 = idf(docFreq=6, maxDocs=44218)
          0.25 = fieldNorm(doc=6198)
    

Similar documents (content)

  1. Fang, L.; Tuan, L.A.; Hui, S.C.; Wu, L.: Syntactic based approach for grammar question retrieval (2018) 0.23
    0.22948246 = sum of:
      0.22948246 = product of:
        1.1474123 = sum of:
          0.015640272 = weight(abstract_txt:with in 5086) [ClassicSimilarity], result of:
            0.015640272 = score(doc=5086,freq=4.0), product of:
              0.050054204 = queryWeight, product of:
                1.2995827 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.015407881 = queryNorm
              0.31246668 = fieldWeight in 5086, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.0625 = fieldNorm(doc=5086)
          0.11123639 = weight(abstract_txt:tree in 5086) [ClassicSimilarity], result of:
            0.11123639 = score(doc=5086,freq=3.0), product of:
              0.15012115 = queryWeight, product of:
                1.4234251 = boost
                6.8448567 = idf(docFreq=127, maxDocs=44218)
                0.015407881 = queryNorm
              0.74097747 = fieldWeight in 5086, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.8448567 = idf(docFreq=127, maxDocs=44218)
                0.0625 = fieldNorm(doc=5086)
          0.09633354 = weight(abstract_txt:sentence in 5086) [ClassicSimilarity], result of:
            0.09633354 = score(doc=5086,freq=1.0), product of:
              0.22518173 = queryWeight, product of:
                2.1351376 = boost
                6.8448567 = idf(docFreq=127, maxDocs=44218)
                0.015407881 = queryNorm
              0.42780355 = fieldWeight in 5086, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.8448567 = idf(docFreq=127, maxDocs=44218)
                0.0625 = fieldNorm(doc=5086)
          0.3962177 = weight(abstract_txt:parse in 5086) [ClassicSimilarity], result of:
            0.3962177 = score(doc=5086,freq=3.0), product of:
              0.40080237 = queryWeight, product of:
                2.848554 = boost
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.015407881 = queryNorm
              0.9885613 = fieldWeight in 5086, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.0625 = fieldNorm(doc=5086)
          0.5279843 = weight(abstract_txt:grammar in 5086) [ClassicSimilarity], result of:
            0.5279843 = score(doc=5086,freq=9.0), product of:
              0.3703914 = queryWeight, product of:
                3.16198 = boost
                7.602543 = idf(docFreq=59, maxDocs=44218)
                0.015407881 = queryNorm
              1.4254768 = fieldWeight in 5086, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                7.602543 = idf(docFreq=59, maxDocs=44218)
                0.0625 = fieldNorm(doc=5086)
        0.2 = coord(5/25)
    
  2. Mutawa, F.; Alnajem, S.; Alzhouri, F.: ¬An HPSG approach to Arabic nominal sentences (2008) 0.13
    0.13317306 = sum of:
      0.13317306 = product of:
        1.1097755 = sum of:
          0.024952639 = weight(abstract_txt:using in 1368) [ClassicSimilarity], result of:
            0.024952639 = score(doc=1368,freq=1.0), product of:
              0.05764201 = queryWeight, product of:
                1.0802613 = boost
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.015407881 = queryNorm
              0.43288982 = fieldWeight in 1368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.125 = fieldNorm(doc=1368)
          0.47515905 = weight(abstract_txt:sentences in 1368) [ClassicSimilarity], result of:
            0.47515905 = score(doc=1368,freq=3.0), product of:
              0.31368467 = queryWeight, product of:
                2.9098816 = boost
                6.996407 = idf(docFreq=109, maxDocs=44218)
                0.015407881 = queryNorm
              1.5147666 = fieldWeight in 1368, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.996407 = idf(docFreq=109, maxDocs=44218)
                0.125 = fieldNorm(doc=1368)
          0.6096638 = weight(abstract_txt:grammar in 1368) [ClassicSimilarity], result of:
            0.6096638 = score(doc=1368,freq=3.0), product of:
              0.3703914 = queryWeight, product of:
                3.16198 = boost
                7.602543 = idf(docFreq=59, maxDocs=44218)
                0.015407881 = queryNorm
              1.6459988 = fieldWeight in 1368, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.602543 = idf(docFreq=59, maxDocs=44218)
                0.125 = fieldNorm(doc=1368)
        0.12 = coord(3/25)
    
  3. Ko, Y.; Park, J.; Seo, J.: Improving text categorization using the importance of sentences (2004) 0.13
    0.13139652 = sum of:
      0.13139652 = product of:
        0.5474855 = sum of:
          0.01764418 = weight(abstract_txt:using in 2557) [ClassicSimilarity], result of:
            0.01764418 = score(doc=2557,freq=2.0), product of:
              0.05764201 = queryWeight, product of:
                1.0802613 = boost
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.015407881 = queryNorm
              0.30609933 = fieldWeight in 2557, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.0625 = fieldNorm(doc=2557)
          0.032191742 = weight(abstract_txt:measure in 2557) [ClassicSimilarity], result of:
            0.032191742 = score(doc=2557,freq=1.0), product of:
              0.09472851 = queryWeight, product of:
                1.1307173 = boost
                5.437306 = idf(docFreq=522, maxDocs=44218)
                0.015407881 = queryNorm
              0.33983162 = fieldWeight in 2557, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.437306 = idf(docFreq=522, maxDocs=44218)
                0.0625 = fieldNorm(doc=2557)
          0.007820136 = weight(abstract_txt:with in 2557) [ClassicSimilarity], result of:
            0.007820136 = score(doc=2557,freq=1.0), product of:
              0.050054204 = queryWeight, product of:
                1.2995827 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.015407881 = queryNorm
              0.15623334 = fieldWeight in 2557, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.0625 = fieldNorm(doc=2557)
          0.09633354 = weight(abstract_txt:sentence in 2557) [ClassicSimilarity], result of:
            0.09633354 = score(doc=2557,freq=1.0), product of:
              0.22518173 = queryWeight, product of:
                2.1351376 = boost
                6.8448567 = idf(docFreq=127, maxDocs=44218)
                0.015407881 = queryNorm
              0.42780355 = fieldWeight in 2557, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.8448567 = idf(docFreq=127, maxDocs=44218)
                0.0625 = fieldNorm(doc=2557)
          0.11916267 = weight(abstract_txt:frequency in 2557) [ClassicSimilarity], result of:
            0.11916267 = score(doc=2557,freq=2.0), product of:
              0.22667895 = queryWeight, product of:
                2.4736273 = boost
                5.947494 = idf(docFreq=313, maxDocs=44218)
                0.015407881 = queryNorm
              0.5256892 = fieldWeight in 2557, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.947494 = idf(docFreq=313, maxDocs=44218)
                0.0625 = fieldNorm(doc=2557)
          0.2743332 = weight(abstract_txt:sentences in 2557) [ClassicSimilarity], result of:
            0.2743332 = score(doc=2557,freq=4.0), product of:
              0.31368467 = queryWeight, product of:
                2.9098816 = boost
                6.996407 = idf(docFreq=109, maxDocs=44218)
                0.015407881 = queryNorm
              0.8745509 = fieldWeight in 2557, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.996407 = idf(docFreq=109, maxDocs=44218)
                0.0625 = fieldNorm(doc=2557)
        0.24 = coord(6/25)
    
  4. Goh, A.; Hui, S.C.; Chan, S.K.: ¬A text extraction system for news reports (1996) 0.13
    0.12601738 = sum of:
      0.12601738 = product of:
        0.5250724 = sum of:
          0.01764418 = weight(abstract_txt:using in 6601) [ClassicSimilarity], result of:
            0.01764418 = score(doc=6601,freq=2.0), product of:
              0.05764201 = queryWeight, product of:
                1.0802613 = boost
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.015407881 = queryNorm
              0.30609933 = fieldWeight in 6601, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.0625 = fieldNorm(doc=6601)
          0.011059342 = weight(abstract_txt:with in 6601) [ClassicSimilarity], result of:
            0.011059342 = score(doc=6601,freq=2.0), product of:
              0.050054204 = queryWeight, product of:
                1.2995827 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.015407881 = queryNorm
              0.22094731 = fieldWeight in 6601, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.0625 = fieldNorm(doc=6601)
          0.051270682 = weight(abstract_txt:measured in 6601) [ClassicSimilarity], result of:
            0.051270682 = score(doc=6601,freq=1.0), product of:
              0.12919046 = queryWeight, product of:
                1.320471 = boost
                6.3497796 = idf(docFreq=209, maxDocs=44218)
                0.015407881 = queryNorm
              0.39686123 = fieldWeight in 6601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.3497796 = idf(docFreq=209, maxDocs=44218)
                0.0625 = fieldNorm(doc=6601)
          0.16685459 = weight(abstract_txt:sentence in 6601) [ClassicSimilarity], result of:
            0.16685459 = score(doc=6601,freq=3.0), product of:
              0.22518173 = queryWeight, product of:
                2.1351376 = boost
                6.8448567 = idf(docFreq=127, maxDocs=44218)
                0.015407881 = queryNorm
              0.74097747 = fieldWeight in 6601, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.8448567 = idf(docFreq=127, maxDocs=44218)
                0.0625 = fieldNorm(doc=6601)
          0.08426073 = weight(abstract_txt:frequency in 6601) [ClassicSimilarity], result of:
            0.08426073 = score(doc=6601,freq=1.0), product of:
              0.22667895 = queryWeight, product of:
                2.4736273 = boost
                5.947494 = idf(docFreq=313, maxDocs=44218)
                0.015407881 = queryNorm
              0.37171838 = fieldWeight in 6601, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.947494 = idf(docFreq=313, maxDocs=44218)
                0.0625 = fieldNorm(doc=6601)
          0.19398287 = weight(abstract_txt:sentences in 6601) [ClassicSimilarity], result of:
            0.19398287 = score(doc=6601,freq=2.0), product of:
              0.31368467 = queryWeight, product of:
                2.9098816 = boost
                6.996407 = idf(docFreq=109, maxDocs=44218)
                0.015407881 = queryNorm
              0.6184009 = fieldWeight in 6601, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.996407 = idf(docFreq=109, maxDocs=44218)
                0.0625 = fieldNorm(doc=6601)
        0.24 = coord(6/25)
    
  5. Modjeska, D.; Chignell, M.: Individual differences in exploration using desktop VR (2003) 0.12
    0.119531974 = sum of:
      0.119531974 = product of:
        0.37353742 = sum of:
          0.03640046 = weight(abstract_txt:test in 5161) [ClassicSimilarity], result of:
            0.03640046 = score(doc=5161,freq=2.0), product of:
              0.08160415 = queryWeight, product of:
                1.0494695 = boost
                5.046608 = idf(docFreq=772, maxDocs=44218)
                0.015407881 = queryNorm
              0.44606134 = fieldWeight in 5161, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.046608 = idf(docFreq=772, maxDocs=44218)
                0.0625 = fieldNorm(doc=5161)
          0.02684117 = weight(abstract_txt:structures in 5161) [ClassicSimilarity], result of:
            0.02684117 = score(doc=5161,freq=1.0), product of:
              0.083917394 = queryWeight, product of:
                1.0642402 = boost
                5.117636 = idf(docFreq=719, maxDocs=44218)
                0.015407881 = queryNorm
              0.31985226 = fieldWeight in 5161, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.117636 = idf(docFreq=719, maxDocs=44218)
                0.0625 = fieldNorm(doc=5161)
          0.012476319 = weight(abstract_txt:using in 5161) [ClassicSimilarity], result of:
            0.012476319 = score(doc=5161,freq=1.0), product of:
              0.05764201 = queryWeight, product of:
                1.0802613 = boost
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.015407881 = queryNorm
              0.21644491 = fieldWeight in 5161, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.0625 = fieldNorm(doc=5161)
          0.045525998 = weight(abstract_txt:measure in 5161) [ClassicSimilarity], result of:
            0.045525998 = score(doc=5161,freq=2.0), product of:
              0.09472851 = queryWeight, product of:
                1.1307173 = boost
                5.437306 = idf(docFreq=522, maxDocs=44218)
                0.015407881 = queryNorm
              0.4805945 = fieldWeight in 5161, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.437306 = idf(docFreq=522, maxDocs=44218)
                0.0625 = fieldNorm(doc=5161)
          0.019155342 = weight(abstract_txt:with in 5161) [ClassicSimilarity], result of:
            0.019155342 = score(doc=5161,freq=6.0), product of:
              0.050054204 = queryWeight, product of:
                1.2995827 = boost
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.015407881 = queryNorm
              0.38269198 = fieldWeight in 5161, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                2.4997334 = idf(docFreq=9868, maxDocs=44218)
                0.0625 = fieldNorm(doc=5161)
          0.051270682 = weight(abstract_txt:measured in 5161) [ClassicSimilarity], result of:
            0.051270682 = score(doc=5161,freq=1.0), product of:
              0.12919046 = queryWeight, product of:
                1.320471 = boost
                6.3497796 = idf(docFreq=209, maxDocs=44218)
                0.015407881 = queryNorm
              0.39686123 = fieldWeight in 5161, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.3497796 = idf(docFreq=209, maxDocs=44218)
                0.0625 = fieldNorm(doc=5161)
          0.068131216 = weight(abstract_txt:perceived in 5161) [ClassicSimilarity], result of:
            0.068131216 = score(doc=5161,freq=1.0), product of:
              0.1787498 = queryWeight, product of:
                1.9023135 = boost
                6.0984654 = idf(docFreq=269, maxDocs=44218)
                0.015407881 = queryNorm
              0.3811541 = fieldWeight in 5161, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0984654 = idf(docFreq=269, maxDocs=44218)
                0.0625 = fieldNorm(doc=5161)
          0.113736235 = weight(abstract_txt:difficulty in 5161) [ClassicSimilarity], result of:
            0.113736235 = score(doc=5161,freq=1.0), product of:
              0.27686003 = queryWeight, product of:
                2.73375 = boost
                6.572923 = idf(docFreq=167, maxDocs=44218)
                0.015407881 = queryNorm
              0.4108077 = fieldWeight in 5161, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.572923 = idf(docFreq=167, maxDocs=44218)
                0.0625 = fieldNorm(doc=5161)
        0.32 = coord(8/25)