Document (#40788)

Author
Kauchak, D.
Leroy, G.
Hogue, A.
Title
Measuring text difficulty using parse-tree frequency
Source
Journal of the Association for Information Science and Technology. 68(2017) no.9, S.2088-2100
Year
2017
Abstract
Text simplification often relies on dated, unproven readability formulas. As an alternative and motivated by the success of term familiarity, we test a complementary measure: grammar familiarity. Grammar familiarity is measured as the frequency of the 3rd level sentence parse tree and is useful for evaluating individual sentences. We created a database of 140K unique 3rd level parse structures by parsing and binning all 5.4M sentences in English Wikipedia. We then calculated the grammar frequencies across the corpus and created 11 frequency bins. We evaluate the measure with a user study and corpus analysis. For the user study, we selected 20 sentences randomly from each bin, controlling for sentence length and term frequency, and recruited 30 readers per sentence (N = 6,600) on Amazon Mechanical Turk. We measured actual difficulty (comprehension) using a Cloze test, perceived difficulty using a 5-point Likert scale, and time taken. Sentences with more frequent grammatical structures, even with very different surface presentations, were easier to understand, perceived as easier, and took less time to read. Outcomes from readability formulas correlated with perceived but not with actual difficulty. Our corpus analysis shows how the metric can be used to understand grammar regularity in a broad range of corpora.
Content
Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23855/full.
Theme
Computerlinguistik

Similar documents (author)

  1. Leroy, G.; Chen, H.: Genescene: an ontology-enhanced integration of linguistic and co-occurrence based relations in biomedical texts (2005) 4.86
    4.85849 = sum of:
      4.85849 = weight(author_txt:leroy in 260) [ClassicSimilarity], result of:
        4.85849 = fieldWeight in 260, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.71698 = idf(docFreq=6, maxDocs=42740)
          0.5 = fieldNorm(doc=260)
    
  2. Leroy, S.Y.; Thomas, S.L.: Impact of Web access on cataloging (2004) 4.86
    4.85849 = sum of:
      4.85849 = weight(author_txt:leroy in 657) [ClassicSimilarity], result of:
        4.85849 = fieldWeight in 657, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.71698 = idf(docFreq=6, maxDocs=42740)
          0.5 = fieldNorm(doc=657)
    
  3. Ku, C.-H.; Leroy, G.: ¬A crime reports analysis system to identify related crimes (2011) 4.25
    4.2511787 = sum of:
      4.2511787 = weight(author_txt:leroy in 1630) [ClassicSimilarity], result of:
        4.2511787 = fieldWeight in 1630, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.71698 = idf(docFreq=6, maxDocs=42740)
          0.4375 = fieldNorm(doc=1630)
    
  4. Leroy, G.; Miller, T.; Rosemblat, G.; Browne, A.: ¬A balanced approach to health information evaluation : a vocabulary-based naïve Bayes classifier and readability formulas (2008) 3.04
    3.0365562 = sum of:
      3.0365562 = weight(author_txt:leroy in 3999) [ClassicSimilarity], result of:
        3.0365562 = fieldWeight in 3999, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.71698 = idf(docFreq=6, maxDocs=42740)
          0.3125 = fieldNorm(doc=3999)
    
  5. Thirion, B.; Leroy, J.P.; Baudic, F.; Douyère, M.; Piot, J.; Darmoni, S.J.: SDI selecting, decribing, and indexing : did you mean automatically? (2001) 2.43
    2.429245 = sum of:
      2.429245 = weight(author_txt:leroy in 199) [ClassicSimilarity], result of:
        2.429245 = fieldWeight in 199, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.71698 = idf(docFreq=6, maxDocs=42740)
          0.25 = fieldNorm(doc=199)
    

Similar documents (content)

  1. Fang, L.; Tuan, L.A.; Hui, S.C.; Wu, L.: Syntactic based approach for grammar question retrieval (2018) 0.23
    0.22968093 = sum of:
      0.22968093 = product of:
        1.1484046 = sum of:
          0.016082799 = weight(abstract_txt:with in 1087) [ClassicSimilarity], result of:
            0.016082799 = score(doc=1087,freq=4.0), product of:
              0.051104728 = queryWeight, product of:
                1.3135262 = boost
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.015453675 = queryNorm
              0.31470278 = fieldWeight in 1087, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.0625 = fieldNorm(doc=1087)
          0.11185251 = weight(abstract_txt:tree in 1087) [ClassicSimilarity], result of:
            0.11185251 = score(doc=1087,freq=3.0), product of:
              0.15100223 = queryWeight, product of:
                1.4280055 = boost
                6.842609 = idf(docFreq=123, maxDocs=42740)
                0.015453675 = queryNorm
              0.74073416 = fieldWeight in 1087, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.842609 = idf(docFreq=123, maxDocs=42740)
                0.0625 = fieldNorm(doc=1087)
          0.098266356 = weight(abstract_txt:sentence in 1087) [ClassicSimilarity], result of:
            0.098266356 = score(doc=1087,freq=1.0), product of:
              0.22867934 = queryWeight, product of:
                2.1522727 = boost
                6.8753986 = idf(docFreq=119, maxDocs=42740)
                0.015453675 = queryNorm
              0.4297124 = fieldWeight in 1087, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.8753986 = idf(docFreq=119, maxDocs=42740)
                0.0625 = fieldNorm(doc=1087)
          0.39436758 = weight(abstract_txt:parse in 1087) [ClassicSimilarity], result of:
            0.39436758 = score(doc=1087,freq=3.0), product of:
              0.40042153 = queryWeight, product of:
                2.8480167 = boost
                9.097941 = idf(docFreq=12, maxDocs=42740)
                0.015453675 = queryNorm
              0.98488104 = fieldWeight in 1087, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.097941 = idf(docFreq=12, maxDocs=42740)
                0.0625 = fieldNorm(doc=1087)
          0.5278354 = weight(abstract_txt:grammar in 1087) [ClassicSimilarity], result of:
            0.5278354 = score(doc=1087,freq=9.0), product of:
              0.3711261 = queryWeight, product of:
                3.166022 = boost
                7.585353 = idf(docFreq=58, maxDocs=42740)
                0.015453675 = queryNorm
              1.4222536 = fieldWeight in 1087, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                7.585353 = idf(docFreq=58, maxDocs=42740)
                0.0625 = fieldNorm(doc=1087)
        0.2 = coord(5/25)
    
  2. Ko, Y.; Park, J.; Seo, J.: Improving text categorization using the importance of sentences (2004) 0.16
    0.16148534 = sum of:
      0.16148534 = product of:
        0.57673335 = sum of:
          0.02269513 = weight(abstract_txt:term in 3558) [ClassicSimilarity], result of:
            0.02269513 = score(doc=3558,freq=1.0), product of:
              0.07519952 = queryWeight, product of:
                1.0077336 = boost
                4.8287816 = idf(docFreq=928, maxDocs=42740)
                0.015453675 = queryNorm
              0.30179885 = fieldWeight in 3558, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8287816 = idf(docFreq=928, maxDocs=42740)
                0.0625 = fieldNorm(doc=3558)
          0.018012505 = weight(abstract_txt:using in 3558) [ClassicSimilarity], result of:
            0.018012505 = score(doc=3558,freq=2.0), product of:
              0.05856837 = queryWeight, product of:
                1.0892195 = boost
                3.4794931 = idf(docFreq=3580, maxDocs=42740)
                0.015453675 = queryNorm
              0.30754665 = fieldWeight in 3558, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4794931 = idf(docFreq=3580, maxDocs=42740)
                0.0625 = fieldNorm(doc=3558)
          0.032384932 = weight(abstract_txt:measure in 3558) [ClassicSimilarity], result of:
            0.032384932 = score(doc=3558,freq=1.0), product of:
              0.09531369 = queryWeight, product of:
                1.1345296 = boost
                5.4363537 = idf(docFreq=505, maxDocs=42740)
                0.015453675 = queryNorm
              0.3397721 = fieldWeight in 3558, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4363537 = idf(docFreq=505, maxDocs=42740)
                0.0625 = fieldNorm(doc=3558)
          0.0080413995 = weight(abstract_txt:with in 3558) [ClassicSimilarity], result of:
            0.0080413995 = score(doc=3558,freq=1.0), product of:
              0.051104728 = queryWeight, product of:
                1.3135262 = boost
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.015453675 = queryNorm
              0.15735139 = fieldWeight in 3558, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.0625 = fieldNorm(doc=3558)
          0.098266356 = weight(abstract_txt:sentence in 3558) [ClassicSimilarity], result of:
            0.098266356 = score(doc=3558,freq=1.0), product of:
              0.22867934 = queryWeight, product of:
                2.1522727 = boost
                6.8753986 = idf(docFreq=119, maxDocs=42740)
                0.015453675 = queryNorm
              0.4297124 = fieldWeight in 3558, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.8753986 = idf(docFreq=119, maxDocs=42740)
                0.0625 = fieldNorm(doc=3558)
          0.12084774 = weight(abstract_txt:frequency in 3558) [ClassicSimilarity], result of:
            0.12084774 = score(doc=3558,freq=2.0), product of:
              0.22930788 = queryWeight, product of:
                2.4886434 = boost
                5.962447 = idf(docFreq=298, maxDocs=42740)
                0.015453675 = queryNorm
              0.52701086 = fieldWeight in 3558, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.962447 = idf(docFreq=298, maxDocs=42740)
                0.0625 = fieldNorm(doc=3558)
          0.27648526 = weight(abstract_txt:sentences in 3558) [ClassicSimilarity], result of:
            0.27648526 = score(doc=3558,freq=4.0), product of:
              0.3160079 = queryWeight, product of:
                2.921475 = boost
                6.9994516 = idf(docFreq=105, maxDocs=42740)
                0.015453675 = queryNorm
              0.87493145 = fieldWeight in 3558, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.9994516 = idf(docFreq=105, maxDocs=42740)
                0.0625 = fieldNorm(doc=3558)
        0.28 = coord(7/25)
    
  3. Doko, A.; Stula, , M.; Seric, L.: Improved sentence retrieval using local context and sentence length (2013) 0.14
    0.14425257 = sum of:
      0.14425257 = product of:
        0.7212628 = sum of:
          0.02836891 = weight(abstract_txt:term in 4706) [ClassicSimilarity], result of:
            0.02836891 = score(doc=4706,freq=1.0), product of:
              0.07519952 = queryWeight, product of:
                1.0077336 = boost
                4.8287816 = idf(docFreq=928, maxDocs=42740)
                0.015453675 = queryNorm
              0.37724856 = fieldWeight in 4706, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8287816 = idf(docFreq=928, maxDocs=42740)
                0.078125 = fieldNorm(doc=4706)
          0.022515632 = weight(abstract_txt:using in 4706) [ClassicSimilarity], result of:
            0.022515632 = score(doc=4706,freq=2.0), product of:
              0.05856837 = queryWeight, product of:
                1.0892195 = boost
                3.4794931 = idf(docFreq=3580, maxDocs=42740)
                0.015453675 = queryNorm
              0.3844333 = fieldWeight in 4706, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4794931 = idf(docFreq=3580, maxDocs=42740)
                0.078125 = fieldNorm(doc=4706)
          0.17371202 = weight(abstract_txt:sentence in 4706) [ClassicSimilarity], result of:
            0.17371202 = score(doc=4706,freq=2.0), product of:
              0.22867934 = queryWeight, product of:
                2.1522727 = boost
                6.8753986 = idf(docFreq=119, maxDocs=42740)
                0.015453675 = queryNorm
              0.7596314 = fieldWeight in 4706, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.8753986 = idf(docFreq=119, maxDocs=42740)
                0.078125 = fieldNorm(doc=4706)
          0.15105967 = weight(abstract_txt:frequency in 4706) [ClassicSimilarity], result of:
            0.15105967 = score(doc=4706,freq=2.0), product of:
              0.22930788 = queryWeight, product of:
                2.4886434 = boost
                5.962447 = idf(docFreq=298, maxDocs=42740)
                0.015453675 = queryNorm
              0.6587636 = fieldWeight in 4706, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.962447 = idf(docFreq=298, maxDocs=42740)
                0.078125 = fieldNorm(doc=4706)
          0.34560657 = weight(abstract_txt:sentences in 4706) [ClassicSimilarity], result of:
            0.34560657 = score(doc=4706,freq=4.0), product of:
              0.3160079 = queryWeight, product of:
                2.921475 = boost
                6.9994516 = idf(docFreq=105, maxDocs=42740)
                0.015453675 = queryNorm
              1.0936643 = fieldWeight in 4706, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.9994516 = idf(docFreq=105, maxDocs=42740)
                0.078125 = fieldNorm(doc=4706)
        0.2 = coord(5/25)
    
  4. Mutawa, F.; Alnajem, S.; Alzhouri, F.: ¬An HPSG approach to Arabic nominal sentences (2008) 0.13
    0.13366221 = sum of:
      0.13366221 = product of:
        1.1138518 = sum of:
          0.02547353 = weight(abstract_txt:using in 3369) [ClassicSimilarity], result of:
            0.02547353 = score(doc=3369,freq=1.0), product of:
              0.05856837 = queryWeight, product of:
                1.0892195 = boost
                3.4794931 = idf(docFreq=3580, maxDocs=42740)
                0.015453675 = queryNorm
              0.43493664 = fieldWeight in 3369, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4794931 = idf(docFreq=3580, maxDocs=42740)
                0.125 = fieldNorm(doc=3369)
          0.47888651 = weight(abstract_txt:sentences in 3369) [ClassicSimilarity], result of:
            0.47888651 = score(doc=3369,freq=3.0), product of:
              0.3160079 = queryWeight, product of:
                2.921475 = boost
                6.9994516 = idf(docFreq=105, maxDocs=42740)
                0.015453675 = queryNorm
              1.5154257 = fieldWeight in 3369, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.9994516 = idf(docFreq=105, maxDocs=42740)
                0.125 = fieldNorm(doc=3369)
          0.6094918 = weight(abstract_txt:grammar in 3369) [ClassicSimilarity], result of:
            0.6094918 = score(doc=3369,freq=3.0), product of:
              0.3711261 = queryWeight, product of:
                3.166022 = boost
                7.585353 = idf(docFreq=58, maxDocs=42740)
                0.015453675 = queryNorm
              1.642277 = fieldWeight in 3369, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.585353 = idf(docFreq=58, maxDocs=42740)
                0.125 = fieldNorm(doc=3369)
        0.12 = coord(3/25)
    
  5. Goh, A.; Hui, S.C.; Chan, S.K.: ¬A text extraction system for news reports (1996) 0.13
    0.1278618 = sum of:
      0.1278618 = product of:
        0.5327575 = sum of:
          0.018012505 = weight(abstract_txt:using in 6670) [ClassicSimilarity], result of:
            0.018012505 = score(doc=6670,freq=2.0), product of:
              0.05856837 = queryWeight, product of:
                1.0892195 = boost
                3.4794931 = idf(docFreq=3580, maxDocs=42740)
                0.015453675 = queryNorm
              0.30754665 = fieldWeight in 6670, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4794931 = idf(docFreq=3580, maxDocs=42740)
                0.0625 = fieldNorm(doc=6670)
          0.011372257 = weight(abstract_txt:with in 6670) [ClassicSimilarity], result of:
            0.011372257 = score(doc=6670,freq=2.0), product of:
              0.051104728 = queryWeight, product of:
                1.3135262 = boost
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.015453675 = queryNorm
              0.22252847 = fieldWeight in 6670, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.0625 = fieldNorm(doc=6670)
          0.052213583 = weight(abstract_txt:measured in 6670) [ClassicSimilarity], result of:
            0.052213583 = score(doc=6670,freq=1.0), product of:
              0.1310536 = queryWeight, product of:
                1.33034 = boost
                6.3746233 = idf(docFreq=197, maxDocs=42740)
                0.015453675 = queryNorm
              0.39841396 = fieldWeight in 6670, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.3746233 = idf(docFreq=197, maxDocs=42740)
                0.0625 = fieldNorm(doc=6670)
          0.17020231 = weight(abstract_txt:sentence in 6670) [ClassicSimilarity], result of:
            0.17020231 = score(doc=6670,freq=3.0), product of:
              0.22867934 = queryWeight, product of:
                2.1522727 = boost
                6.8753986 = idf(docFreq=119, maxDocs=42740)
                0.015453675 = queryNorm
              0.74428374 = fieldWeight in 6670, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.8753986 = idf(docFreq=119, maxDocs=42740)
                0.0625 = fieldNorm(doc=6670)
          0.08545226 = weight(abstract_txt:frequency in 6670) [ClassicSimilarity], result of:
            0.08545226 = score(doc=6670,freq=1.0), product of:
              0.22930788 = queryWeight, product of:
                2.4886434 = boost
                5.962447 = idf(docFreq=298, maxDocs=42740)
                0.015453675 = queryNorm
              0.37265295 = fieldWeight in 6670, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.962447 = idf(docFreq=298, maxDocs=42740)
                0.0625 = fieldNorm(doc=6670)
          0.19550459 = weight(abstract_txt:sentences in 6670) [ClassicSimilarity], result of:
            0.19550459 = score(doc=6670,freq=2.0), product of:
              0.3160079 = queryWeight, product of:
                2.921475 = boost
                6.9994516 = idf(docFreq=105, maxDocs=42740)
                0.015453675 = queryNorm
              0.6186699 = fieldWeight in 6670, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.9994516 = idf(docFreq=105, maxDocs=42740)
                0.0625 = fieldNorm(doc=6670)
        0.24 = coord(6/25)