Document (#43264)

Author
Du, C.
Cohoon, J.
Lopez, P.
Howison, J.
Title
Softcite dataset : a dataset of software mentions in biomedical and economic research publications
Source
Journal of the Association for Information Science and Technology. 72(2021) no.7, S.870-884
Year
2021
Abstract
Software contributions to academic research are relatively invisible, especially to the formalized scholarly reputation system based on bibliometrics. In this article, we introduce a gold-standard dataset of software mentions from the manual annotation of 4,971 academic PDFs in biomedicine and economics. The dataset is intended to be used for automatic extraction of software mentions from PDF format research publications by supervised learning at scale. We provide a description of the dataset and an extended discussion of its creation process, including improved text conversion of academic PDFs. Finally, we reflect on our challenges and lessons learned during the dataset creation, in hope of encouraging more discussion about creating datasets for machine learning use.
Content
Vgl.: https://asistdl.onlinelibrary.wiley.com/doi/10.1002/asi.24454.
Form
Software

Similar documents (author)

  1. Lopez, C.G.: Technical processes and the technological development of the library system in the National Autonomous University of Mexico (2000) 5.27
    5.274244 = sum of:
      5.274244 = weight(author_txt:lopez in 5371) [ClassicSimilarity], result of:
        5.274244 = fieldWeight in 5371, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.43879 = idf(docFreq=25, maxDocs=44218)
          0.625 = fieldNorm(doc=5371)
    
  2. Lopez, P.: Artificial Intelligence und die normative Kraft des Faktischen (2021) 5.27
    5.274244 = sum of:
      5.274244 = weight(author_txt:lopez in 1025) [ClassicSimilarity], result of:
        5.274244 = fieldWeight in 1025, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.43879 = idf(docFreq=25, maxDocs=44218)
          0.625 = fieldNorm(doc=1025)
    
  3. Lopez, P.: ChatGPT und der Unterschied zwischen Form und Inhalt (2023) 5.27
    5.274244 = sum of:
      5.274244 = weight(author_txt:lopez in 1027) [ClassicSimilarity], result of:
        5.274244 = fieldWeight in 1027, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.43879 = idf(docFreq=25, maxDocs=44218)
          0.625 = fieldNorm(doc=1027)
    
  4. Cozar, E.D. Lopez- -> Lopez-Cozar, E.D.: 4.48
    4.4753447 = sum of:
      4.4753447 = weight(author_txt:lopez in 1188) [ClassicSimilarity], result of:
        4.4753447 = fieldWeight in 1188, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.43879 = idf(docFreq=25, maxDocs=44218)
          0.375 = fieldNorm(doc=1188)
    
  5. Pujalte, C. Lopez- -> Lopez-Pujalte, C.: 4.48
    4.4753447 = sum of:
      4.4753447 = weight(author_txt:lopez in 2746) [ClassicSimilarity], result of:
        4.4753447 = fieldWeight in 2746, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.43879 = idf(docFreq=25, maxDocs=44218)
          0.375 = fieldNorm(doc=2746)
    

Similar documents (content)

  1. Ahmed, M.: Automatic indexing for agriculture : designing a framework by deploying Agrovoc, Agris and Annif (2023) 0.18
    0.1810056 = sum of:
      0.1810056 = product of:
        0.905028 = sum of:
          0.042514734 = weight(abstract_txt:learned in 1024) [ClassicSimilarity], result of:
            0.042514734 = score(doc=1024,freq=1.0), product of:
              0.103584126 = queryWeight, product of:
                1.0499673 = boost
                6.5669885 = idf(docFreq=168, maxDocs=44218)
                0.015022811 = queryNorm
              0.41043678 = fieldWeight in 1024, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5669885 = idf(docFreq=168, maxDocs=44218)
                0.0625 = fieldNorm(doc=1024)
          0.062393993 = weight(abstract_txt:supervised in 1024) [ClassicSimilarity], result of:
            0.062393993 = score(doc=1024,freq=1.0), product of:
              0.13377103 = queryWeight, product of:
                1.1931916 = boost
                7.462781 = idf(docFreq=68, maxDocs=44218)
                0.015022811 = queryNorm
              0.4664238 = fieldWeight in 1024, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.462781 = idf(docFreq=68, maxDocs=44218)
                0.0625 = fieldNorm(doc=1024)
          0.05576371 = weight(abstract_txt:learning in 1024) [ClassicSimilarity], result of:
            0.05576371 = score(doc=1024,freq=3.0), product of:
              0.10842704 = queryWeight, product of:
                1.519193 = boost
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.015022811 = queryNorm
              0.51429707 = fieldWeight in 1024, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.0625 = fieldNorm(doc=1024)
          0.014350885 = weight(abstract_txt:research in 1024) [ClassicSimilarity], result of:
            0.014350885 = score(doc=1024,freq=1.0), product of:
              0.07242577 = queryWeight, product of:
                1.5206748 = boost
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.015022811 = queryNorm
              0.19814612 = fieldWeight in 1024, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.0625 = fieldNorm(doc=1024)
          0.73000467 = weight(abstract_txt:dataset in 1024) [ClassicSimilarity], result of:
            0.73000467 = score(doc=1024,freq=7.0), product of:
              0.65489006 = queryWeight, product of:
                6.4667935 = boost
                6.7410603 = idf(docFreq=141, maxDocs=44218)
                0.015022811 = queryNorm
              1.114698 = fieldWeight in 1024, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.7410603 = idf(docFreq=141, maxDocs=44218)
                0.0625 = fieldNorm(doc=1024)
        0.2 = coord(5/25)
    
  2. ¬The Computer Science Ontology (CSO) (2018) 0.15
    0.14996752 = sum of:
      0.14996752 = product of:
        0.5355983 = sum of:
          0.032195196 = weight(abstract_txt:learning in 4429) [ClassicSimilarity], result of:
            0.032195196 = score(doc=4429,freq=1.0), product of:
              0.10842704 = queryWeight, product of:
                1.519193 = boost
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.015022811 = queryNorm
              0.29692957 = fieldWeight in 4429, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.0625 = fieldNorm(doc=4429)
          0.02870177 = weight(abstract_txt:research in 4429) [ClassicSimilarity], result of:
            0.02870177 = score(doc=4429,freq=4.0), product of:
              0.07242577 = queryWeight, product of:
                1.5206748 = boost
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.015022811 = queryNorm
              0.39629224 = fieldWeight in 4429, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.0625 = fieldNorm(doc=4429)
          0.040876117 = weight(abstract_txt:discussion in 4429) [ClassicSimilarity], result of:
            0.040876117 = score(doc=4429,freq=1.0), product of:
              0.12713252 = queryWeight, product of:
                1.645025 = boost
                5.144379 = idf(docFreq=700, maxDocs=44218)
                0.015022811 = queryNorm
              0.3215237 = fieldWeight in 4429, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.144379 = idf(docFreq=700, maxDocs=44218)
                0.0625 = fieldNorm(doc=4429)
          0.062103555 = weight(abstract_txt:publications in 4429) [ClassicSimilarity], result of:
            0.062103555 = score(doc=4429,freq=2.0), product of:
              0.13335559 = queryWeight, product of:
                1.6848055 = boost
                5.268782 = idf(docFreq=618, maxDocs=44218)
                0.015022811 = queryNorm
              0.46569893 = fieldWeight in 4429, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.268782 = idf(docFreq=618, maxDocs=44218)
                0.0625 = fieldNorm(doc=4429)
          0.0461718 = weight(abstract_txt:academic in 4429) [ClassicSimilarity], result of:
            0.0461718 = score(doc=4429,freq=1.0), product of:
              0.15784295 = queryWeight, product of:
                2.244928 = boost
                4.6802773 = idf(docFreq=1114, maxDocs=44218)
                0.015022811 = queryNorm
              0.29251733 = fieldWeight in 4429, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6802773 = idf(docFreq=1114, maxDocs=44218)
                0.0625 = fieldNorm(doc=4429)
          0.049634054 = weight(abstract_txt:software in 4429) [ClassicSimilarity], result of:
            0.049634054 = score(doc=4429,freq=1.0), product of:
              0.18230842 = queryWeight, product of:
                2.7858808 = boost
                4.3560514 = idf(docFreq=1541, maxDocs=44218)
                0.015022811 = queryNorm
              0.27225322 = fieldWeight in 4429, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3560514 = idf(docFreq=1541, maxDocs=44218)
                0.0625 = fieldNorm(doc=4429)
          0.27591583 = weight(abstract_txt:dataset in 4429) [ClassicSimilarity], result of:
            0.27591583 = score(doc=4429,freq=1.0), product of:
              0.65489006 = queryWeight, product of:
                6.4667935 = boost
                6.7410603 = idf(docFreq=141, maxDocs=44218)
                0.015022811 = queryNorm
              0.42131627 = fieldWeight in 4429, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.7410603 = idf(docFreq=141, maxDocs=44218)
                0.0625 = fieldNorm(doc=4429)
        0.28 = coord(7/25)
    
  3. Yu, M.; Sun, A.: Dataset versus reality : understanding model performance from the perspective of information need (2023) 0.14
    0.14413303 = sum of:
      0.14413303 = product of:
        0.72066516 = sum of:
          0.09161845 = weight(abstract_txt:datasets in 1073) [ClassicSimilarity], result of:
            0.09161845 = score(doc=1073,freq=6.0), product of:
              0.10396003 = queryWeight, product of:
                1.0518707 = boost
                6.578893 = idf(docFreq=166, maxDocs=44218)
                0.015022811 = queryNorm
              0.8812853 = fieldWeight in 1073, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                6.578893 = idf(docFreq=166, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1073)
          0.028170794 = weight(abstract_txt:learning in 1073) [ClassicSimilarity], result of:
            0.028170794 = score(doc=1073,freq=1.0), product of:
              0.10842704 = queryWeight, product of:
                1.519193 = boost
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.015022811 = queryNorm
              0.25981337 = fieldWeight in 1073, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1073)
          0.025114048 = weight(abstract_txt:research in 1073) [ClassicSimilarity], result of:
            0.025114048 = score(doc=1073,freq=4.0), product of:
              0.07242577 = queryWeight, product of:
                1.5206748 = boost
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.015022811 = queryNorm
              0.3467557 = fieldWeight in 1073, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1073)
          0.035916116 = weight(abstract_txt:creation in 1073) [ClassicSimilarity], result of:
            0.035916116 = score(doc=1073,freq=1.0), product of:
              0.12748657 = queryWeight, product of:
                1.647314 = boost
                5.1515374 = idf(docFreq=695, maxDocs=44218)
                0.015022811 = queryNorm
              0.2817247 = fieldWeight in 1073, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1515374 = idf(docFreq=695, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1073)
          0.53984576 = weight(abstract_txt:dataset in 1073) [ClassicSimilarity], result of:
            0.53984576 = score(doc=1073,freq=5.0), product of:
              0.65489006 = queryWeight, product of:
                6.4667935 = boost
                6.7410603 = idf(docFreq=141, maxDocs=44218)
                0.015022811 = queryNorm
              0.82433033 = fieldWeight in 1073, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                6.7410603 = idf(docFreq=141, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1073)
        0.2 = coord(5/25)
    
  4. Jiao, H.; Qiu, Y.; Ma, X.; Yang, B.: Dissmination effect of data papers on scientific datasets (2024) 0.12
    0.12089374 = sum of:
      0.12089374 = product of:
        0.6044687 = sum of:
          0.09558379 = weight(abstract_txt:datasets in 1204) [ClassicSimilarity], result of:
            0.09558379 = score(doc=1204,freq=5.0), product of:
              0.10396003 = queryWeight, product of:
                1.0518707 = boost
                6.578893 = idf(docFreq=166, maxDocs=44218)
                0.015022811 = queryNorm
              0.9194283 = fieldWeight in 1204, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                6.578893 = idf(docFreq=166, maxDocs=44218)
                0.0625 = fieldNorm(doc=1204)
          0.054471985 = weight(abstract_txt:biomedical in 1204) [ClassicSimilarity], result of:
            0.054471985 = score(doc=1204,freq=1.0), product of:
              0.12219376 = queryWeight, product of:
                1.1403908 = boost
                7.132539 = idf(docFreq=95, maxDocs=44218)
                0.015022811 = queryNorm
              0.44578367 = fieldWeight in 1204, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.132539 = idf(docFreq=95, maxDocs=44218)
                0.0625 = fieldNorm(doc=1204)
          0.020295216 = weight(abstract_txt:research in 1204) [ClassicSimilarity], result of:
            0.020295216 = score(doc=1204,freq=2.0), product of:
              0.07242577 = queryWeight, product of:
                1.5206748 = boost
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.015022811 = queryNorm
              0.28022093 = fieldWeight in 1204, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.0625 = fieldNorm(doc=1204)
          0.043913845 = weight(abstract_txt:publications in 1204) [ClassicSimilarity], result of:
            0.043913845 = score(doc=1204,freq=1.0), product of:
              0.13335559 = queryWeight, product of:
                1.6848055 = boost
                5.268782 = idf(docFreq=618, maxDocs=44218)
                0.015022811 = queryNorm
              0.32929888 = fieldWeight in 1204, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.268782 = idf(docFreq=618, maxDocs=44218)
                0.0625 = fieldNorm(doc=1204)
          0.3902039 = weight(abstract_txt:dataset in 1204) [ClassicSimilarity], result of:
            0.3902039 = score(doc=1204,freq=2.0), product of:
              0.65489006 = queryWeight, product of:
                6.4667935 = boost
                6.7410603 = idf(docFreq=141, maxDocs=44218)
                0.015022811 = queryNorm
              0.59583116 = fieldWeight in 1204, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.7410603 = idf(docFreq=141, maxDocs=44218)
                0.0625 = fieldNorm(doc=1204)
        0.2 = coord(5/25)
    
  5. Mai, F.; Galke, L.; Scherp, A.: Using deep learning for title-based semantic subject indexing to reach competitive performance to full-text (2018) 0.12
    0.11635754 = sum of:
      0.11635754 = product of:
        0.5817877 = sum of:
          0.052895933 = weight(abstract_txt:datasets in 4093) [ClassicSimilarity], result of:
            0.052895933 = score(doc=4093,freq=2.0), product of:
              0.10396003 = queryWeight, product of:
                1.0518707 = boost
                6.578893 = idf(docFreq=166, maxDocs=44218)
                0.015022811 = queryNorm
              0.5088103 = fieldWeight in 4093, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.578893 = idf(docFreq=166, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4093)
          0.04413364 = weight(abstract_txt:economics in 4093) [ClassicSimilarity], result of:
            0.04413364 = score(doc=4093,freq=1.0), product of:
              0.11608462 = queryWeight, product of:
                1.111518 = boost
                6.9519553 = idf(docFreq=114, maxDocs=44218)
                0.015022811 = queryNorm
              0.38018507 = fieldWeight in 4093, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9519553 = idf(docFreq=114, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4093)
          0.028170794 = weight(abstract_txt:learning in 4093) [ClassicSimilarity], result of:
            0.028170794 = score(doc=4093,freq=1.0), product of:
              0.10842704 = queryWeight, product of:
                1.519193 = boost
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.015022811 = queryNorm
              0.25981337 = fieldWeight in 4093, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.750873 = idf(docFreq=1038, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4093)
          0.038424615 = weight(abstract_txt:publications in 4093) [ClassicSimilarity], result of:
            0.038424615 = score(doc=4093,freq=1.0), product of:
              0.13335559 = queryWeight, product of:
                1.6848055 = boost
                5.268782 = idf(docFreq=618, maxDocs=44218)
                0.015022811 = queryNorm
              0.2881365 = fieldWeight in 4093, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.268782 = idf(docFreq=618, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4093)
          0.4181627 = weight(abstract_txt:dataset in 4093) [ClassicSimilarity], result of:
            0.4181627 = score(doc=4093,freq=3.0), product of:
              0.65489006 = queryWeight, product of:
                6.4667935 = boost
                6.7410603 = idf(docFreq=141, maxDocs=44218)
                0.015022811 = queryNorm
              0.6385235 = fieldWeight in 4093, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.7410603 = idf(docFreq=141, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4093)
        0.2 = coord(5/25)