Document (#41310)

Author
Toepfer, M.
Seifert, C.
Title
Content-based quality estimation for automatic subject indexing of short texts under precision and recall constraints
Issue
[Submitted on 7 Jun 2018].
Source
https://arxiv.org/abs/1806.02743
Abstract
Semantic annotations have to satisfy quality constraints to be useful for digital libraries, which is particularly challenging on large and diverse datasets. Confidence scores of multi-label classification methods typically refer only to the relevance of particular subjects, disregarding indicators of insufficient content representation at the document-level. Therefore, we propose a novel approach that detects documents rather than concepts where quality criteria are met. Our approach uses a deep, multi-layered regression architecture, which comprises a variety of content-based indicators. We evaluated multiple configurations using text collections from law and economics, where the available content is restricted to very short texts. Notably, we demonstrate that the proposed quality estimation technique can determine subsets of the previously unseen data where considerable gains in document-level recall can be achieved, while upholding precision at the same time. Hence, the approach effectively performs a filtering that ensures high data quality standards in operative information retrieval systems.
Content
This is an authors' manuscript version of a paper accepted for proceedings of TPDL-2018, Porto, Portugal, Sept 10-13. The nal authenticated publication is available online at https://doi.org/will be added as soon as available.
Theme
Automatisches Indexieren
Retrievalstudien

Similar documents (author)

  1. Seifert, S.: Johann Samuel Ersch, der Begründer der neueren Bibliographie in Deutschland : seine Entwicklung bis zum 'Allgemeinen Repertorium der Literatur' (1968) 6.01
    6.010904 = sum of:
      6.010904 = weight(author_txt:seifert in 4968) [ClassicSimilarity], result of:
        6.010904 = fieldWeight in 4968, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.617446 = idf(docFreq=7, maxDocs=44218)
          0.625 = fieldNorm(doc=4968)
    
  2. Seifert, S.: Universelle bibliographische Verzeichnisse an der Wende vom 18. zum 19. Jahrhundert : historische Analyse und aktuelle Schlußfolgerungen (1987) 6.01
    6.010904 = sum of:
      6.010904 = weight(author_txt:seifert in 2471) [ClassicSimilarity], result of:
        6.010904 = fieldWeight in 2471, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.617446 = idf(docFreq=7, maxDocs=44218)
          0.625 = fieldNorm(doc=2471)
    
  3. Seifert, W.: Herausforderungen bei der Abbildung von Regionalstudien in der Regensburger Verbundklassifikation (2018) 6.01
    6.010904 = sum of:
      6.010904 = weight(author_txt:seifert in 4600) [ClassicSimilarity], result of:
        6.010904 = fieldWeight in 4600, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.617446 = idf(docFreq=7, maxDocs=44218)
          0.625 = fieldNorm(doc=4600)
    
  4. Graner, B.; Fresenborg, M.; Lühr, A.; Seifert, J.; Sünkler, S.: Schriftgutverwaltung an der Hochschule : Entwicklung eines aufgabenorientierten Aktenplans für die Hochschule für Angewandte Wissenschaften Hamburg (2009) 3.01
    3.005452 = sum of:
      3.005452 = weight(author_txt:seifert in 3134) [ClassicSimilarity], result of:
        3.005452 = fieldWeight in 3134, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.617446 = idf(docFreq=7, maxDocs=44218)
          0.3125 = fieldNorm(doc=3134)
    
  5. Böhm, A.; Seifert, C.; Schlötterer, J.; Granitzer, M.: Identifying tweets from the economic domain (2017) 3.01
    3.005452 = sum of:
      3.005452 = weight(author_txt:seifert in 3495) [ClassicSimilarity], result of:
        3.005452 = fieldWeight in 3495, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.617446 = idf(docFreq=7, maxDocs=44218)
          0.3125 = fieldNorm(doc=3495)
    

Similar documents (content)

  1. Costers, L.: ¬The electronic library and its organizational management (1994) 0.13
    0.13263327 = sum of:
      0.13263327 = product of:
        0.66316634 = sum of:
          0.16211875 = weight(abstract_txt:layered in 1214) [ClassicSimilarity], result of:
            0.16211875 = score(doc=1214,freq=1.0), product of:
              0.18007548 = queryWeight, product of:
                1.0850431 = boost
                8.231152 = idf(docFreq=31, maxDocs=44218)
                0.020162622 = queryNorm
              0.9002822 = fieldWeight in 1214, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.231152 = idf(docFreq=31, maxDocs=44218)
                0.109375 = fieldNorm(doc=1214)
          0.074824184 = weight(abstract_txt:level in 1214) [ClassicSimilarity], result of:
            0.074824184 = score(doc=1214,freq=2.0), product of:
              0.10754587 = queryWeight, product of:
                1.1858549 = boost
                4.497956 = idf(docFreq=1337, maxDocs=44218)
                0.020162622 = queryNorm
              0.6957421 = fieldWeight in 1214, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.497956 = idf(docFreq=1337, maxDocs=44218)
                0.109375 = fieldNorm(doc=1214)
          0.04581865 = weight(abstract_txt:approach in 1214) [ClassicSimilarity], result of:
            0.04581865 = score(doc=1214,freq=1.0), product of:
              0.111849576 = queryWeight, product of:
                1.4811448 = boost
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.020162622 = queryNorm
              0.40964526 = fieldWeight in 1214, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.109375 = fieldNorm(doc=1214)
          0.12680988 = weight(abstract_txt:where in 1214) [ClassicSimilarity], result of:
            0.12680988 = score(doc=1214,freq=2.0), product of:
              0.17499739 = queryWeight, product of:
                1.8526616 = boost
                4.684772 = idf(docFreq=1109, maxDocs=44218)
                0.020162622 = queryNorm
              0.7246387 = fieldWeight in 1214, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.684772 = idf(docFreq=1109, maxDocs=44218)
                0.109375 = fieldNorm(doc=1214)
          0.25359488 = weight(abstract_txt:quality in 1214) [ClassicSimilarity], result of:
            0.25359488 = score(doc=1214,freq=3.0), product of:
              0.28770164 = queryWeight, product of:
                3.0667324 = boost
                4.6528544 = idf(docFreq=1145, maxDocs=44218)
                0.020162622 = queryNorm
              0.88145095 = fieldWeight in 1214, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.6528544 = idf(docFreq=1145, maxDocs=44218)
                0.109375 = fieldNorm(doc=1214)
        0.2 = coord(5/25)
    
  2. Buchholz, K.: Criteria for the analysis of scientific quality (1995) 0.13
    0.13015947 = sum of:
      0.13015947 = product of:
        0.5423311 = sum of:
          0.090649255 = weight(abstract_txt:notably in 2450) [ClassicSimilarity], result of:
            0.090649255 = score(doc=2450,freq=1.0), product of:
              0.15295392 = queryWeight, product of:
                7.5860133 = idf(docFreq=60, maxDocs=44218)
                0.020162622 = queryNorm
              0.59265727 = fieldWeight in 2450, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5860133 = idf(docFreq=60, maxDocs=44218)
                0.078125 = fieldNorm(doc=2450)
          0.03779192 = weight(abstract_txt:level in 2450) [ClassicSimilarity], result of:
            0.03779192 = score(doc=2450,freq=1.0), product of:
              0.10754587 = queryWeight, product of:
                1.1858549 = boost
                4.497956 = idf(docFreq=1337, maxDocs=44218)
                0.020162622 = queryNorm
              0.3514028 = fieldWeight in 2450, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.497956 = idf(docFreq=1337, maxDocs=44218)
                0.078125 = fieldNorm(doc=2450)
          0.11441316 = weight(abstract_txt:short in 2450) [ClassicSimilarity], result of:
            0.11441316 = score(doc=2450,freq=2.0), product of:
              0.17863578 = queryWeight, product of:
                1.5283363 = boost
                5.79699 = idf(docFreq=364, maxDocs=44218)
                0.020162622 = queryNorm
              0.6404829 = fieldWeight in 2450, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.79699 = idf(docFreq=364, maxDocs=44218)
                0.078125 = fieldNorm(doc=2450)
          0.09091977 = weight(abstract_txt:indicators in 2450) [ClassicSimilarity], result of:
            0.09091977 = score(doc=2450,freq=1.0), product of:
              0.19309306 = queryWeight, product of:
                1.5889785 = boost
                6.027006 = idf(docFreq=289, maxDocs=44218)
                0.020162622 = queryNorm
              0.47085986 = fieldWeight in 2450, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.027006 = idf(docFreq=289, maxDocs=44218)
                0.078125 = fieldNorm(doc=2450)
          0.060657468 = weight(abstract_txt:content in 2450) [ClassicSimilarity], result of:
            0.060657468 = score(doc=2450,freq=1.0), product of:
              0.18574934 = queryWeight, product of:
                2.2040088 = boost
                4.17991 = idf(docFreq=1838, maxDocs=44218)
                0.020162622 = queryNorm
              0.3265555 = fieldWeight in 2450, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.17991 = idf(docFreq=1838, maxDocs=44218)
                0.078125 = fieldNorm(doc=2450)
          0.14789954 = weight(abstract_txt:quality in 2450) [ClassicSimilarity], result of:
            0.14789954 = score(doc=2450,freq=2.0), product of:
              0.28770164 = queryWeight, product of:
                3.0667324 = boost
                4.6528544 = idf(docFreq=1145, maxDocs=44218)
                0.020162622 = queryNorm
              0.51407266 = fieldWeight in 2450, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.6528544 = idf(docFreq=1145, maxDocs=44218)
                0.078125 = fieldNorm(doc=2450)
        0.24 = coord(6/25)
    
  3. Kenter, T.; Balog, K.; Rijke, M. de: Evaluating document filtering systems over time (2015) 0.09
    0.09247106 = sum of:
      0.09247106 = product of:
        0.46235532 = sum of:
          0.05632366 = weight(abstract_txt:document in 2672) [ClassicSimilarity], result of:
            0.05632366 = score(doc=2672,freq=6.0), product of:
              0.09795033 = queryWeight, product of:
                1.1317165 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.020162622 = queryNorm
              0.57502264 = fieldWeight in 2672, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0546875 = fieldNorm(doc=2672)
          0.069343746 = weight(abstract_txt:precision in 2672) [ClassicSimilarity], result of:
            0.069343746 = score(doc=2672,freq=2.0), product of:
              0.16227712 = queryWeight, product of:
                1.4566772 = boost
                5.5251865 = idf(docFreq=478, maxDocs=44218)
                0.020162622 = queryNorm
              0.42731684 = fieldWeight in 2672, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.5251865 = idf(docFreq=478, maxDocs=44218)
                0.0546875 = fieldNorm(doc=2672)
          0.07811058 = weight(abstract_txt:recall in 2672) [ClassicSimilarity], result of:
            0.07811058 = score(doc=2672,freq=2.0), product of:
              0.17568135 = queryWeight, product of:
                1.5156451 = boost
                5.7488523 = idf(docFreq=382, maxDocs=44218)
                0.020162622 = queryNorm
              0.44461513 = fieldWeight in 2672, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.7488523 = idf(docFreq=382, maxDocs=44218)
                0.0546875 = fieldNorm(doc=2672)
          0.05663163 = weight(abstract_txt:short in 2672) [ClassicSimilarity], result of:
            0.05663163 = score(doc=2672,freq=1.0), product of:
              0.17863578 = queryWeight, product of:
                1.5283363 = boost
                5.79699 = idf(docFreq=364, maxDocs=44218)
                0.020162622 = queryNorm
              0.3170229 = fieldWeight in 2672, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.79699 = idf(docFreq=364, maxDocs=44218)
                0.0546875 = fieldNorm(doc=2672)
          0.20194569 = weight(abstract_txt:estimation in 2672) [ClassicSimilarity], result of:
            0.20194569 = score(doc=2672,freq=2.0), product of:
              0.33093458 = queryWeight, product of:
                2.0802033 = boost
                7.890225 = idf(docFreq=44, maxDocs=44218)
                0.020162622 = queryNorm
              0.6102284 = fieldWeight in 2672, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.890225 = idf(docFreq=44, maxDocs=44218)
                0.0546875 = fieldNorm(doc=2672)
        0.2 = coord(5/25)
    
  4. Daranyi, S.; Zawiasa, R.; Hajnal, Z.: Conceptual mapping of a database in the humanities : first results of an experiment with Sophia (1996) 0.09
    0.090284206 = sum of:
      0.090284206 = product of:
        0.5642763 = sum of:
          0.1507084 = weight(abstract_txt:configurations in 4497) [ClassicSimilarity], result of:
            0.1507084 = score(doc=4497,freq=1.0), product of:
              0.17152368 = queryWeight, product of:
                1.0589653 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.020162622 = queryNorm
              0.87864494 = fieldWeight in 4497, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.109375 = fieldNorm(doc=4497)
          0.14863549 = weight(abstract_txt:texts in 4497) [ClassicSimilarity], result of:
            0.14863549 = score(doc=4497,freq=2.0), product of:
              0.16994724 = queryWeight, product of:
                1.4907051 = boost
                5.6542544 = idf(docFreq=420, maxDocs=44218)
                0.020162622 = queryNorm
              0.87459785 = fieldWeight in 4497, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.6542544 = idf(docFreq=420, maxDocs=44218)
                0.109375 = fieldNorm(doc=4497)
          0.18001196 = weight(abstract_txt:indicators in 4497) [ClassicSimilarity], result of:
            0.18001196 = score(doc=4497,freq=2.0), product of:
              0.19309306 = queryWeight, product of:
                1.5889785 = boost
                6.027006 = idf(docFreq=289, maxDocs=44218)
                0.020162622 = queryNorm
              0.9322549 = fieldWeight in 4497, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.027006 = idf(docFreq=289, maxDocs=44218)
                0.109375 = fieldNorm(doc=4497)
          0.08492045 = weight(abstract_txt:content in 4497) [ClassicSimilarity], result of:
            0.08492045 = score(doc=4497,freq=1.0), product of:
              0.18574934 = queryWeight, product of:
                2.2040088 = boost
                4.17991 = idf(docFreq=1838, maxDocs=44218)
                0.020162622 = queryNorm
              0.45717767 = fieldWeight in 4497, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.17991 = idf(docFreq=1838, maxDocs=44218)
                0.109375 = fieldNorm(doc=4497)
        0.16 = coord(4/25)
    
  5. Tang, J.; Liang, B.-Y.; Li, J.-Z.: Toward detecting mapping strategies for ontology interoperability (2005) 0.09
    0.08557715 = sum of:
      0.08557715 = product of:
        0.42788577 = sum of:
          0.12650341 = weight(abstract_txt:unseen in 3367) [ClassicSimilarity], result of:
            0.12650341 = score(doc=3367,freq=1.0), product of:
              0.22164568 = queryWeight, product of:
                1.2037861 = boost
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.020162622 = queryNorm
              0.5707461 = fieldWeight in 3367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.0625 = fieldNorm(doc=3367)
          0.05603821 = weight(abstract_txt:precision in 3367) [ClassicSimilarity], result of:
            0.05603821 = score(doc=3367,freq=1.0), product of:
              0.16227712 = queryWeight, product of:
                1.4566772 = boost
                5.5251865 = idf(docFreq=478, maxDocs=44218)
                0.020162622 = queryNorm
              0.34532416 = fieldWeight in 3367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5251865 = idf(docFreq=478, maxDocs=44218)
                0.0625 = fieldNorm(doc=3367)
          0.026182083 = weight(abstract_txt:approach in 3367) [ClassicSimilarity], result of:
            0.026182083 = score(doc=3367,freq=1.0), product of:
              0.111849576 = queryWeight, product of:
                1.4811448 = boost
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.020162622 = queryNorm
              0.234083 = fieldWeight in 3367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.0625 = fieldNorm(doc=3367)
          0.06312288 = weight(abstract_txt:recall in 3367) [ClassicSimilarity], result of:
            0.06312288 = score(doc=3367,freq=1.0), product of:
              0.17568135 = queryWeight, product of:
                1.5156451 = boost
                5.7488523 = idf(docFreq=382, maxDocs=44218)
                0.020162622 = queryNorm
              0.35930327 = fieldWeight in 3367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7488523 = idf(docFreq=382, maxDocs=44218)
                0.0625 = fieldNorm(doc=3367)
          0.1560392 = weight(abstract_txt:multi in 3367) [ClassicSimilarity], result of:
            0.1560392 = score(doc=3367,freq=5.0), product of:
              0.18783085 = queryWeight, product of:
                1.5671774 = boost
                5.9443145 = idf(docFreq=314, maxDocs=44218)
                0.020162622 = queryNorm
              0.8307432 = fieldWeight in 3367, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.9443145 = idf(docFreq=314, maxDocs=44218)
                0.0625 = fieldNorm(doc=3367)
        0.2 = coord(5/25)