Document (#41311)

Author
Toepfer, M.
Seifert, C.
Title
Content-based quality estimation for automatic subject indexing of short texts under precision and recall constraints
Issue
[Submitted on 7 Jun 2018].
Source
https://arxiv.org/abs/1806.02743
Abstract
Semantic annotations have to satisfy quality constraints to be useful for digital libraries, which is particularly challenging on large and diverse datasets. Confidence scores of multi-label classification methods typically refer only to the relevance of particular subjects, disregarding indicators of insufficient content representation at the document-level. Therefore, we propose a novel approach that detects documents rather than concepts where quality criteria are met. Our approach uses a deep, multi-layered regression architecture, which comprises a variety of content-based indicators. We evaluated multiple configurations using text collections from law and economics, where the available content is restricted to very short texts. Notably, we demonstrate that the proposed quality estimation technique can determine subsets of the previously unseen data where considerable gains in document-level recall can be achieved, while upholding precision at the same time. Hence, the approach effectively performs a filtering that ensures high data quality standards in operative information retrieval systems.
Content
This is an authors' manuscript version of a paper accepted for proceedings of TPDL-2018, Porto, Portugal, Sept 10-13. The nal authenticated publication is available online at https://doi.org/will be added as soon as available.
Theme
Automatisches Indexieren
Retrievalstudien

Similar documents (author)

  1. Seifert, S.: Johann Samuel Ersch, der Begründer der neueren Bibliographie in Deutschland : seine Entwicklung bis zum 'Allgemeinen Repertorium der Literatur' (1968) 6.00
    5.9971275 = sum of:
      5.9971275 = weight(author_txt:seifert in 4968) [ClassicSimilarity], result of:
        5.9971275 = fieldWeight in 4968, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.595404 = idf(docFreq=7, maxDocs=43254)
          0.625 = fieldNorm(doc=4968)
    
  2. Seifert, S.: Universelle bibliographische Verzeichnisse an der Wende vom 18. zum 19. Jahrhundert : historische Analyse und aktuelle Schlußfolgerungen (1987) 6.00
    5.9971275 = sum of:
      5.9971275 = weight(author_txt:seifert in 3540) [ClassicSimilarity], result of:
        5.9971275 = fieldWeight in 3540, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.595404 = idf(docFreq=7, maxDocs=43254)
          0.625 = fieldNorm(doc=3540)
    
  3. Seifert, W.: Herausforderungen bei der Abbildung von Regionalstudien in der Regensburger Verbundklassifikation (2018) 6.00
    5.9971275 = sum of:
      5.9971275 = weight(author_txt:seifert in 601) [ClassicSimilarity], result of:
        5.9971275 = fieldWeight in 601, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.595404 = idf(docFreq=7, maxDocs=43254)
          0.625 = fieldNorm(doc=601)
    
  4. Graner, B.; Fresenborg, M.; Lühr, A.; Seifert, J.; Sünkler, S.: Schriftgutverwaltung an der Hochschule : Entwicklung eines aufgabenorientierten Aktenplans für die Hochschule für Angewandte Wissenschaften Hamburg (2009) 3.00
    2.9985638 = sum of:
      2.9985638 = weight(author_txt:seifert in 135) [ClassicSimilarity], result of:
        2.9985638 = fieldWeight in 135, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.595404 = idf(docFreq=7, maxDocs=43254)
          0.3125 = fieldNorm(doc=135)
    
  5. Böhm, A.; Seifert, C.; Schlötterer, J.; Granitzer, M.: Identifying tweets from the economic domain (2017) 3.00
    2.9985638 = sum of:
      2.9985638 = weight(author_txt:seifert in 4960) [ClassicSimilarity], result of:
        2.9985638 = fieldWeight in 4960, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.595404 = idf(docFreq=7, maxDocs=43254)
          0.3125 = fieldNorm(doc=4960)
    

Similar documents (content)

  1. Costers, L.: ¬The electronic library and its organizational management (1994) 0.13
    0.13333736 = sum of:
      0.13333736 = product of:
        0.6666868 = sum of:
          0.16189688 = weight(abstract_txt:layered in 2283) [ClassicSimilarity], result of:
            0.16189688 = score(doc=2283,freq=1.0), product of:
              0.17961724 = queryWeight, product of:
                1.0773077 = boost
                8.240858 = idf(docFreq=30, maxDocs=43254)
                0.02023186 = queryNorm
              0.9013438 = fieldWeight in 2283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.240858 = idf(docFreq=30, maxDocs=43254)
                0.109375 = fieldNorm(doc=2283)
          0.0756089 = weight(abstract_txt:level in 2283) [ClassicSimilarity], result of:
            0.0756089 = score(doc=2283,freq=2.0), product of:
              0.10811957 = queryWeight, product of:
                1.182042 = boost
                4.5210114 = idf(docFreq=1278, maxDocs=43254)
                0.02023186 = queryNorm
              0.6993082 = fieldWeight in 2283, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.5210114 = idf(docFreq=1278, maxDocs=43254)
                0.109375 = fieldNorm(doc=2283)
          0.0459799 = weight(abstract_txt:approach in 2283) [ClassicSimilarity], result of:
            0.0459799 = score(doc=2283,freq=1.0), product of:
              0.11192869 = queryWeight, product of:
                1.4729809 = boost
                3.7558525 = idf(docFreq=2748, maxDocs=43254)
                0.02023186 = queryNorm
              0.41079637 = fieldWeight in 2283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7558525 = idf(docFreq=2748, maxDocs=43254)
                0.109375 = fieldNorm(doc=2283)
          0.1276061 = weight(abstract_txt:where in 2283) [ClassicSimilarity], result of:
            0.1276061 = score(doc=2283,freq=2.0), product of:
              0.17544205 = queryWeight, product of:
                1.8441373 = boost
                4.7022386 = idf(docFreq=1066, maxDocs=43254)
                0.02023186 = queryNorm
              0.7273404 = fieldWeight in 2283, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.7022386 = idf(docFreq=1066, maxDocs=43254)
                0.109375 = fieldNorm(doc=2283)
          0.25559503 = weight(abstract_txt:quality in 2283) [ClassicSimilarity], result of:
            0.25559503 = score(doc=2283,freq=3.0), product of:
              0.28873995 = queryWeight, product of:
                3.0542474 = boost
                4.672689 = idf(docFreq=1098, maxDocs=43254)
                0.02023186 = queryNorm
              0.8852084 = fieldWeight in 2283, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.672689 = idf(docFreq=1098, maxDocs=43254)
                0.109375 = fieldNorm(doc=2283)
        0.2 = coord(5/25)
    
  2. Buchholz, K.: Criteria for the analysis of scientific quality (1995) 0.13
    0.13122907 = sum of:
      0.13122907 = product of:
        0.5467878 = sum of:
          0.092489235 = weight(abstract_txt:notably in 3519) [ClassicSimilarity], result of:
            0.092489235 = score(doc=3519,freq=1.0), product of:
              0.15476348 = queryWeight, product of:
                7.649493 = idf(docFreq=55, maxDocs=43254)
                0.02023186 = queryNorm
              0.5976167 = fieldWeight in 3519, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.649493 = idf(docFreq=55, maxDocs=43254)
                0.078125 = fieldNorm(doc=3519)
          0.038188267 = weight(abstract_txt:level in 3519) [ClassicSimilarity], result of:
            0.038188267 = score(doc=3519,freq=1.0), product of:
              0.10811957 = queryWeight, product of:
                1.182042 = boost
                4.5210114 = idf(docFreq=1278, maxDocs=43254)
                0.02023186 = queryNorm
              0.353204 = fieldWeight in 3519, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.5210114 = idf(docFreq=1278, maxDocs=43254)
                0.078125 = fieldNorm(doc=3519)
          0.11452559 = weight(abstract_txt:short in 3519) [ClassicSimilarity], result of:
            0.11452559 = score(doc=3519,freq=2.0), product of:
              0.17846075 = queryWeight, product of:
                1.5186305 = boost
                5.808377 = idf(docFreq=352, maxDocs=43254)
                0.02023186 = queryNorm
              0.64174104 = fieldWeight in 3519, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.808377 = idf(docFreq=352, maxDocs=43254)
                0.078125 = fieldNorm(doc=3519)
          0.09155193 = weight(abstract_txt:indicators in 3519) [ClassicSimilarity], result of:
            0.09155193 = score(doc=3519,freq=1.0), product of:
              0.19367015 = queryWeight, product of:
                1.5820205 = boost
                6.0508275 = idf(docFreq=276, maxDocs=43254)
                0.02023186 = queryNorm
              0.4727209 = fieldWeight in 3519, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0508275 = idf(docFreq=276, maxDocs=43254)
                0.078125 = fieldNorm(doc=3519)
          0.06096671 = weight(abstract_txt:content in 3519) [ClassicSimilarity], result of:
            0.06096671 = score(doc=3519,freq=1.0), product of:
              0.18607564 = queryWeight, product of:
                2.1930096 = boost
                4.193853 = idf(docFreq=1773, maxDocs=43254)
                0.02023186 = queryNorm
              0.32764477 = fieldWeight in 3519, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.193853 = idf(docFreq=1773, maxDocs=43254)
                0.078125 = fieldNorm(doc=3519)
          0.14906606 = weight(abstract_txt:quality in 3519) [ClassicSimilarity], result of:
            0.14906606 = score(doc=3519,freq=2.0), product of:
              0.28873995 = queryWeight, product of:
                3.0542474 = boost
                4.672689 = idf(docFreq=1098, maxDocs=43254)
                0.02023186 = queryNorm
              0.5162641 = fieldWeight in 3519, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.672689 = idf(docFreq=1098, maxDocs=43254)
                0.078125 = fieldNorm(doc=3519)
        0.24 = coord(6/25)
    
  3. Kenter, T.; Balog, K.; Rijke, M. de: Evaluating document filtering systems over time (2015) 0.09
    0.09183406 = sum of:
      0.09183406 = product of:
        0.45917028 = sum of:
          0.05571337 = weight(abstract_txt:document in 4137) [ClassicSimilarity], result of:
            0.05571337 = score(doc=4137,freq=6.0), product of:
              0.09708263 = queryWeight, product of:
                1.1200864 = boost
                4.2840466 = idf(docFreq=1620, maxDocs=43254)
                0.02023186 = queryNorm
              0.5738758 = fieldWeight in 4137, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.2840466 = idf(docFreq=1620, maxDocs=43254)
                0.0546875 = fieldNorm(doc=4137)
          0.068730526 = weight(abstract_txt:precision in 4137) [ClassicSimilarity], result of:
            0.068730526 = score(doc=4137,freq=2.0), product of:
              0.16105546 = queryWeight, product of:
                1.442675 = boost
                5.517866 = idf(docFreq=471, maxDocs=43254)
                0.02023186 = queryNorm
              0.4267507 = fieldWeight in 4137, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.517866 = idf(docFreq=471, maxDocs=43254)
                0.0546875 = fieldNorm(doc=4137)
          0.07704866 = weight(abstract_txt:recall in 4137) [ClassicSimilarity], result of:
            0.07704866 = score(doc=4137,freq=2.0), product of:
              0.17380105 = queryWeight, product of:
                1.4986733 = boost
                5.7320457 = idf(docFreq=380, maxDocs=43254)
                0.02023186 = queryNorm
              0.44331527 = fieldWeight in 4137, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.7320457 = idf(docFreq=380, maxDocs=43254)
                0.0546875 = fieldNorm(doc=4137)
          0.056687273 = weight(abstract_txt:short in 4137) [ClassicSimilarity], result of:
            0.056687273 = score(doc=4137,freq=1.0), product of:
              0.17846075 = queryWeight, product of:
                1.5186305 = boost
                5.808377 = idf(docFreq=352, maxDocs=43254)
                0.02023186 = queryNorm
              0.3176456 = fieldWeight in 4137, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.808377 = idf(docFreq=352, maxDocs=43254)
                0.0546875 = fieldNorm(doc=4137)
          0.20099045 = weight(abstract_txt:estimation in 4137) [ClassicSimilarity], result of:
            0.20099045 = score(doc=4137,freq=2.0), product of:
              0.32935122 = queryWeight, product of:
                2.0630531 = boost
                7.8906555 = idf(docFreq=43, maxDocs=43254)
                0.02023186 = queryNorm
              0.61026174 = fieldWeight in 4137, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.8906555 = idf(docFreq=43, maxDocs=43254)
                0.0546875 = fieldNorm(doc=4137)
        0.2 = coord(5/25)
    
  4. Daranyi, S.; Zawiasa, R.; Hajnal, Z.: Conceptual mapping of a database in the humanities : first results of an experiment with Sophia (1996) 0.09
    0.09017812 = sum of:
      0.09017812 = product of:
        0.56361324 = sum of:
          0.14873989 = weight(abstract_txt:configurations in 5566) [ClassicSimilarity], result of:
            0.14873989 = score(doc=5566,freq=1.0), product of:
              0.16974904 = queryWeight, product of:
                1.0472959 = boost
                8.011283 = idf(docFreq=38, maxDocs=43254)
                0.02023186 = queryNorm
              0.87623405 = fieldWeight in 5566, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.011283 = idf(docFreq=38, maxDocs=43254)
                0.109375 = fieldNorm(doc=5566)
          0.1482564 = weight(abstract_txt:texts in 5566) [ClassicSimilarity], result of:
            0.1482564 = score(doc=5566,freq=2.0), product of:
              0.169381 = queryWeight, product of:
                1.4794936 = boost
                5.658688 = idf(docFreq=409, maxDocs=43254)
                0.02023186 = queryNorm
              0.8752836 = fieldWeight in 5566, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.658688 = idf(docFreq=409, maxDocs=43254)
                0.109375 = fieldNorm(doc=5566)
          0.18126357 = weight(abstract_txt:indicators in 5566) [ClassicSimilarity], result of:
            0.18126357 = score(doc=5566,freq=2.0), product of:
              0.19367015 = queryWeight, product of:
                1.5820205 = boost
                6.0508275 = idf(docFreq=276, maxDocs=43254)
                0.02023186 = queryNorm
              0.9359396 = fieldWeight in 5566, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.0508275 = idf(docFreq=276, maxDocs=43254)
                0.109375 = fieldNorm(doc=5566)
          0.08535339 = weight(abstract_txt:content in 5566) [ClassicSimilarity], result of:
            0.08535339 = score(doc=5566,freq=1.0), product of:
              0.18607564 = queryWeight, product of:
                2.1930096 = boost
                4.193853 = idf(docFreq=1773, maxDocs=43254)
                0.02023186 = queryNorm
              0.45870265 = fieldWeight in 5566, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.193853 = idf(docFreq=1773, maxDocs=43254)
                0.109375 = fieldNorm(doc=5566)
        0.16 = coord(4/25)
    
  5. Tang, J.; Liang, B.-Y.; Li, J.-Z.: Toward detecting mapping strategies for ontology interoperability (2005) 0.09
    0.08600507 = sum of:
      0.08600507 = product of:
        0.43002534 = sum of:
          0.12829831 = weight(abstract_txt:unseen in 368) [ClassicSimilarity], result of:
            0.12829831 = score(doc=368,freq=1.0), product of:
              0.22337179 = queryWeight, product of:
                1.2013787 = boost
                9.189939 = idf(docFreq=11, maxDocs=43254)
                0.02023186 = queryNorm
              0.57437116 = fieldWeight in 368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.189939 = idf(docFreq=11, maxDocs=43254)
                0.0625 = fieldNorm(doc=368)
          0.055542655 = weight(abstract_txt:precision in 368) [ClassicSimilarity], result of:
            0.055542655 = score(doc=368,freq=1.0), product of:
              0.16105546 = queryWeight, product of:
                1.442675 = boost
                5.517866 = idf(docFreq=471, maxDocs=43254)
                0.02023186 = queryNorm
              0.34486663 = fieldWeight in 368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.517866 = idf(docFreq=471, maxDocs=43254)
                0.0625 = fieldNorm(doc=368)
          0.026274227 = weight(abstract_txt:approach in 368) [ClassicSimilarity], result of:
            0.026274227 = score(doc=368,freq=1.0), product of:
              0.11192869 = queryWeight, product of:
                1.4729809 = boost
                3.7558525 = idf(docFreq=2748, maxDocs=43254)
                0.02023186 = queryNorm
              0.23474078 = fieldWeight in 368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7558525 = idf(docFreq=2748, maxDocs=43254)
                0.0625 = fieldNorm(doc=368)
          0.06226472 = weight(abstract_txt:recall in 368) [ClassicSimilarity], result of:
            0.06226472 = score(doc=368,freq=1.0), product of:
              0.17380105 = queryWeight, product of:
                1.4986733 = boost
                5.7320457 = idf(docFreq=380, maxDocs=43254)
                0.02023186 = queryNorm
              0.35825285 = fieldWeight in 368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7320457 = idf(docFreq=380, maxDocs=43254)
                0.0625 = fieldNorm(doc=368)
          0.15764545 = weight(abstract_txt:multi in 368) [ClassicSimilarity], result of:
            0.15764545 = score(doc=368,freq=5.0), product of:
              0.1888087 = queryWeight, product of:
                1.5620385 = boost
                5.9744015 = idf(docFreq=298, maxDocs=43254)
                0.02023186 = queryNorm
              0.834948 = fieldWeight in 368, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.9744015 = idf(docFreq=298, maxDocs=43254)
                0.0625 = fieldNorm(doc=368)
        0.2 = coord(5/25)