Document (#41311)

Author
Toepfer, M.
Seifert, C.
Title
Content-based quality estimation for automatic subject indexing of short texts under precision and recall constraints
Issue
[Submitted on 7 Jun 2018].
Source
https://arxiv.org/abs/1806.02743
Abstract
Semantic annotations have to satisfy quality constraints to be useful for digital libraries, which is particularly challenging on large and diverse datasets. Confidence scores of multi-label classification methods typically refer only to the relevance of particular subjects, disregarding indicators of insufficient content representation at the document-level. Therefore, we propose a novel approach that detects documents rather than concepts where quality criteria are met. Our approach uses a deep, multi-layered regression architecture, which comprises a variety of content-based indicators. We evaluated multiple configurations using text collections from law and economics, where the available content is restricted to very short texts. Notably, we demonstrate that the proposed quality estimation technique can determine subsets of the previously unseen data where considerable gains in document-level recall can be achieved, while upholding precision at the same time. Hence, the approach effectively performs a filtering that ensures high data quality standards in operative information retrieval systems.
Content
This is an authors' manuscript version of a paper accepted for proceedings of TPDL-2018, Porto, Portugal, Sept 10-13. The nal authenticated publication is available online at https://doi.org/will be added as soon as available.
Theme
Automatisches Indexieren
Retrievalstudien

Similar documents (author)

  1. Seifert, S.: Johann Samuel Ersch, der Begründer der neueren Bibliographie in Deutschland : seine Entwicklung bis zum 'Allgemeinen Repertorium der Literatur' (1968) 5.99
    5.989656 = sum of:
      5.989656 = weight(author_txt:seifert in 4968) [ClassicSimilarity], result of:
        5.989656 = fieldWeight in 4968, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.583449 = idf(docFreq=7, maxDocs=42740)
          0.625 = fieldNorm(doc=4968)
    
  2. Seifert, S.: Universelle bibliographische Verzeichnisse an der Wende vom 18. zum 19. Jahrhundert : historische Analyse und aktuelle Schlußfolgerungen (1987) 5.99
    5.989656 = sum of:
      5.989656 = weight(author_txt:seifert in 2540) [ClassicSimilarity], result of:
        5.989656 = fieldWeight in 2540, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.583449 = idf(docFreq=7, maxDocs=42740)
          0.625 = fieldNorm(doc=2540)
    
  3. Seifert, W.: Herausforderungen bei der Abbildung von Regionalstudien in der Regensburger Verbundklassifikation (2018) 5.99
    5.989656 = sum of:
      5.989656 = weight(author_txt:seifert in 601) [ClassicSimilarity], result of:
        5.989656 = fieldWeight in 601, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.583449 = idf(docFreq=7, maxDocs=42740)
          0.625 = fieldNorm(doc=601)
    
  4. Graner, B.; Fresenborg, M.; Lühr, A.; Seifert, J.; Sünkler, S.: Schriftgutverwaltung an der Hochschule : Entwicklung eines aufgabenorientierten Aktenplans für die Hochschule für Angewandte Wissenschaften Hamburg (2009) 2.99
    2.994828 = sum of:
      2.994828 = weight(author_txt:seifert in 135) [ClassicSimilarity], result of:
        2.994828 = fieldWeight in 135, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.583449 = idf(docFreq=7, maxDocs=42740)
          0.3125 = fieldNorm(doc=135)
    
  5. Böhm, A.; Seifert, C.; Schlötterer, J.; Granitzer, M.: Identifying tweets from the economic domain (2017) 2.99
    2.994828 = sum of:
      2.994828 = weight(author_txt:seifert in 5496) [ClassicSimilarity], result of:
        2.994828 = fieldWeight in 5496, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.583449 = idf(docFreq=7, maxDocs=42740)
          0.3125 = fieldNorm(doc=5496)
    

Similar documents (content)

  1. Costers, L.: ¬The electronic library and its organizational management (1994) 0.13
    0.13404778 = sum of:
      0.13404778 = product of:
        0.67023885 = sum of:
          0.16275363 = weight(abstract_txt:layered in 1283) [ClassicSimilarity], result of:
            0.16275363 = score(doc=1283,freq=1.0), product of:
              0.18011238 = queryWeight, product of:
                1.0765955 = boost
                8.261693 = idf(docFreq=29, maxDocs=42740)
                0.020249857 = queryNorm
              0.9036227 = fieldWeight in 1283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.261693 = idf(docFreq=29, maxDocs=42740)
                0.109375 = fieldNorm(doc=1283)
          0.075986385 = weight(abstract_txt:level in 1283) [ClassicSimilarity], result of:
            0.075986385 = score(doc=1283,freq=2.0), product of:
              0.1083961 = queryWeight, product of:
                1.1811433 = boost
                4.5319915 = idf(docFreq=1249, maxDocs=42740)
                0.020249857 = queryNorm
              0.70100665 = fieldWeight in 1283, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.5319915 = idf(docFreq=1249, maxDocs=42740)
                0.109375 = fieldNorm(doc=1283)
          0.046480015 = weight(abstract_txt:approach in 1283) [ClassicSimilarity], result of:
            0.046480015 = score(doc=1283,freq=1.0), product of:
              0.112652555 = queryWeight, product of:
                1.474728 = boost
                3.772308 = idf(docFreq=2671, maxDocs=42740)
                0.020249857 = queryNorm
              0.4125962 = fieldWeight in 1283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.772308 = idf(docFreq=2671, maxDocs=42740)
                0.109375 = fieldNorm(doc=1283)
          0.12834887 = weight(abstract_txt:where in 1283) [ClassicSimilarity], result of:
            0.12834887 = score(doc=1283,freq=2.0), product of:
              0.1759874 = queryWeight, product of:
                1.8432412 = boost
                4.7149534 = idf(docFreq=1040, maxDocs=42740)
                0.020249857 = queryNorm
              0.7293072 = fieldWeight in 1283, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.7149534 = idf(docFreq=1040, maxDocs=42740)
                0.109375 = fieldNorm(doc=1283)
          0.25666997 = weight(abstract_txt:quality in 1283) [ClassicSimilarity], result of:
            0.25666997 = score(doc=1283,freq=3.0), product of:
              0.2893273 = queryWeight, product of:
                3.0511284 = boost
                4.6828146 = idf(docFreq=1074, maxDocs=42740)
                0.020249857 = queryNorm
              0.8871267 = fieldWeight in 1283, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.6828146 = idf(docFreq=1074, maxDocs=42740)
                0.109375 = fieldNorm(doc=1283)
        0.2 = coord(5/25)
    
  2. Buchholz, K.: Criteria for the analysis of scientific quality (1995) 0.13
    0.13170947 = sum of:
      0.13170947 = product of:
        0.5487895 = sum of:
          0.09316332 = weight(abstract_txt:notably in 2519) [ClassicSimilarity], result of:
            0.09316332 = score(doc=2519,freq=1.0), product of:
              0.15539551 = queryWeight, product of:
                7.6739063 = idf(docFreq=53, maxDocs=42740)
                0.020249857 = queryNorm
              0.5995239 = fieldWeight in 2519, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6739063 = idf(docFreq=53, maxDocs=42740)
                0.078125 = fieldNorm(doc=2519)
          0.03837892 = weight(abstract_txt:level in 2519) [ClassicSimilarity], result of:
            0.03837892 = score(doc=2519,freq=1.0), product of:
              0.1083961 = queryWeight, product of:
                1.1811433 = boost
                4.5319915 = idf(docFreq=1249, maxDocs=42740)
                0.020249857 = queryNorm
              0.35406184 = fieldWeight in 2519, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.5319915 = idf(docFreq=1249, maxDocs=42740)
                0.078125 = fieldNorm(doc=2519)
          0.1145693 = weight(abstract_txt:short in 2519) [ClassicSimilarity], result of:
            0.1145693 = score(doc=2519,freq=2.0), product of:
              0.17836952 = queryWeight, product of:
                1.5151516 = boost
                5.8135657 = idf(docFreq=346, maxDocs=42740)
                0.020249857 = queryNorm
              0.6423143 = fieldWeight in 2519, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.8135657 = idf(docFreq=346, maxDocs=42740)
                0.078125 = fieldNorm(doc=2519)
          0.091961 = weight(abstract_txt:indicators in 2519) [ClassicSimilarity], result of:
            0.091961 = score(doc=2519,freq=1.0), product of:
              0.19409792 = queryWeight, product of:
                1.5805427 = boost
                6.0644684 = idf(docFreq=269, maxDocs=42740)
                0.020249857 = queryNorm
              0.4737866 = fieldWeight in 2519, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0644684 = idf(docFreq=269, maxDocs=42740)
                0.078125 = fieldNorm(doc=2519)
          0.061024025 = weight(abstract_txt:content in 2519) [ClassicSimilarity], result of:
            0.061024025 = score(doc=2519,freq=1.0), product of:
              0.18604973 = queryWeight, product of:
                2.1883929 = boost
                4.1983805 = idf(docFreq=1744, maxDocs=42740)
                0.020249857 = queryNorm
              0.32799846 = fieldWeight in 2519, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1983805 = idf(docFreq=1744, maxDocs=42740)
                0.078125 = fieldNorm(doc=2519)
          0.14969297 = weight(abstract_txt:quality in 2519) [ClassicSimilarity], result of:
            0.14969297 = score(doc=2519,freq=2.0), product of:
              0.2893273 = queryWeight, product of:
                3.0511284 = boost
                4.6828146 = idf(docFreq=1074, maxDocs=42740)
                0.020249857 = queryNorm
              0.5173828 = fieldWeight in 2519, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.6828146 = idf(docFreq=1074, maxDocs=42740)
                0.078125 = fieldNorm(doc=2519)
        0.24 = coord(6/25)
    
  3. Kenter, T.; Balog, K.; Rijke, M. de: Evaluating document filtering systems over time (2015) 0.09
    0.091446534 = sum of:
      0.091446534 = product of:
        0.45723265 = sum of:
          0.055457927 = weight(abstract_txt:document in 4673) [ClassicSimilarity], result of:
            0.055457927 = score(doc=4673,freq=6.0), product of:
              0.09671157 = queryWeight, product of:
                1.115668 = boost
                4.280766 = idf(docFreq=1606, maxDocs=42740)
                0.020249857 = queryNorm
              0.5734363 = fieldWeight in 4673, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.280766 = idf(docFreq=1606, maxDocs=42740)
                0.0546875 = fieldNorm(doc=4673)
          0.06852419 = weight(abstract_txt:precision in 4673) [ClassicSimilarity], result of:
            0.06852419 = score(doc=4673,freq=2.0), product of:
              0.16060993 = queryWeight, product of:
                1.4377453 = boost
                5.5165615 = idf(docFreq=466, maxDocs=42740)
                0.020249857 = queryNorm
              0.42664978 = fieldWeight in 4673, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.5165615 = idf(docFreq=466, maxDocs=42740)
                0.0546875 = fieldNorm(doc=4673)
          0.076922394 = weight(abstract_txt:recall in 4673) [ClassicSimilarity], result of:
            0.076922394 = score(doc=4673,freq=2.0), product of:
              0.17347823 = queryWeight, product of:
                1.4942328 = boost
                5.733301 = idf(docFreq=375, maxDocs=42740)
                0.020249857 = queryNorm
              0.4434124 = fieldWeight in 4673, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.733301 = idf(docFreq=375, maxDocs=42740)
                0.0546875 = fieldNorm(doc=4673)
          0.056708913 = weight(abstract_txt:short in 4673) [ClassicSimilarity], result of:
            0.056708913 = score(doc=4673,freq=1.0), product of:
              0.17836952 = queryWeight, product of:
                1.5151516 = boost
                5.8135657 = idf(docFreq=346, maxDocs=42740)
                0.020249857 = queryNorm
              0.3179294 = fieldWeight in 4673, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8135657 = idf(docFreq=346, maxDocs=42740)
                0.0546875 = fieldNorm(doc=4673)
          0.19961922 = weight(abstract_txt:estimation in 4673) [ClassicSimilarity], result of:
            0.19961922 = score(doc=4673,freq=2.0), product of:
              0.3276006 = queryWeight, product of:
                2.0533743 = boost
                7.8787007 = idf(docFreq=43, maxDocs=42740)
                0.020249857 = queryNorm
              0.60933715 = fieldWeight in 4673, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.8787007 = idf(docFreq=43, maxDocs=42740)
                0.0546875 = fieldNorm(doc=4673)
        0.2 = coord(5/25)
    
  4. Daranyi, S.; Zawiasa, R.; Hajnal, Z.: Conceptual mapping of a database in the humanities : first results of an experiment with Sophia (1996) 0.09
    0.09023411 = sum of:
      0.09023411 = product of:
        0.5639632 = sum of:
          0.14773528 = weight(abstract_txt:configurations in 4566) [ClassicSimilarity], result of:
            0.14773528 = score(doc=4566,freq=1.0), product of:
              0.16885449 = queryWeight, product of:
                1.0424064 = boost
                7.999329 = idf(docFreq=38, maxDocs=42740)
                0.020249857 = queryNorm
              0.8749266 = fieldWeight in 4566, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.999329 = idf(docFreq=38, maxDocs=42740)
                0.109375 = fieldNorm(doc=4566)
          0.14872076 = weight(abstract_txt:texts in 4566) [ClassicSimilarity], result of:
            0.14872076 = score(doc=4566,freq=2.0), product of:
              0.16960454 = queryWeight, product of:
                1.4774559 = boost
                5.668929 = idf(docFreq=400, maxDocs=42740)
                0.020249857 = queryNorm
              0.8768678 = fieldWeight in 4566, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.668929 = idf(docFreq=400, maxDocs=42740)
                0.109375 = fieldNorm(doc=4566)
          0.18207347 = weight(abstract_txt:indicators in 4566) [ClassicSimilarity], result of:
            0.18207347 = score(doc=4566,freq=2.0), product of:
              0.19409792 = queryWeight, product of:
                1.5805427 = boost
                6.0644684 = idf(docFreq=269, maxDocs=42740)
                0.020249857 = queryNorm
              0.93804955 = fieldWeight in 4566, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.0644684 = idf(docFreq=269, maxDocs=42740)
                0.109375 = fieldNorm(doc=4566)
          0.08543364 = weight(abstract_txt:content in 4566) [ClassicSimilarity], result of:
            0.08543364 = score(doc=4566,freq=1.0), product of:
              0.18604973 = queryWeight, product of:
                2.1883929 = boost
                4.1983805 = idf(docFreq=1744, maxDocs=42740)
                0.020249857 = queryNorm
              0.45919788 = fieldWeight in 4566, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1983805 = idf(docFreq=1744, maxDocs=42740)
                0.109375 = fieldNorm(doc=4566)
        0.16 = coord(4/25)
    
  5. Tang, J.; Liang, B.-Y.; Li, J.-Z.: Toward detecting mapping strategies for ontology interoperability (2005) 0.09
    0.08648011 = sum of:
      0.08648011 = product of:
        0.43240055 = sum of:
          0.13116595 = weight(abstract_txt:unseen in 368) [ClassicSimilarity], result of:
            0.13116595 = score(doc=368,freq=1.0), product of:
              0.22651444 = queryWeight, product of:
                1.2073376 = boost
                9.264996 = idf(docFreq=10, maxDocs=42740)
                0.020249857 = queryNorm
              0.5790622 = fieldWeight in 368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.264996 = idf(docFreq=10, maxDocs=42740)
                0.0625 = fieldNorm(doc=368)
          0.05537591 = weight(abstract_txt:precision in 368) [ClassicSimilarity], result of:
            0.05537591 = score(doc=368,freq=1.0), product of:
              0.16060993 = queryWeight, product of:
                1.4377453 = boost
                5.5165615 = idf(docFreq=466, maxDocs=42740)
                0.020249857 = queryNorm
              0.3447851 = fieldWeight in 368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5165615 = idf(docFreq=466, maxDocs=42740)
                0.0625 = fieldNorm(doc=368)
          0.026560009 = weight(abstract_txt:approach in 368) [ClassicSimilarity], result of:
            0.026560009 = score(doc=368,freq=1.0), product of:
              0.112652555 = queryWeight, product of:
                1.474728 = boost
                3.772308 = idf(docFreq=2671, maxDocs=42740)
                0.020249857 = queryNorm
              0.23576926 = fieldWeight in 368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.772308 = idf(docFreq=2671, maxDocs=42740)
                0.0625 = fieldNorm(doc=368)
          0.062162682 = weight(abstract_txt:recall in 368) [ClassicSimilarity], result of:
            0.062162682 = score(doc=368,freq=1.0), product of:
              0.17347823 = queryWeight, product of:
                1.4942328 = boost
                5.733301 = idf(docFreq=375, maxDocs=42740)
                0.020249857 = queryNorm
              0.35833132 = fieldWeight in 368, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.733301 = idf(docFreq=375, maxDocs=42740)
                0.0625 = fieldNorm(doc=368)
          0.157136 = weight(abstract_txt:multi in 368) [ClassicSimilarity], result of:
            0.157136 = score(doc=368,freq=5.0), product of:
              0.18825749 = queryWeight, product of:
                1.5565816 = boost
                5.972531 = idf(docFreq=295, maxDocs=42740)
                0.020249857 = queryNorm
              0.8346866 = fieldWeight in 368, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.972531 = idf(docFreq=295, maxDocs=42740)
                0.0625 = fieldNorm(doc=368)
        0.2 = coord(5/25)