Search (1 results, page 1 of 1)

Du, C.; Cohoon, J.; Lopez, P.; Howison, J.: Softcite dataset : a dataset of software mentions in biomedical and economic research publications (2021) 0.00
```
0.0016616598 = product of:
  0.0033233196 = sum of:
    0.0033233196 = product of:
      0.006646639 = sum of:
        0.006646639 = weight(_text_:a in 262) [ClassicSimilarity], result of:
          0.006646639 = score(doc=262,freq=8.0), product of:
            0.043477926 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.037706986 = queryNorm
            0.15287387 = fieldWeight in 262, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=262)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Software contributions to academic research are relatively invisible, especially to the formalized scholarly reputation system based on bibliometrics. In this article, we introduce a gold-standard dataset of software mentions from the manual annotation of 4,971 academic PDFs in biomedicine and economics. The dataset is intended to be used for automatic extraction of software mentions from PDF format research publications by supervised learning at scale. We provide a description of the dataset and an extended discussion of its creation process, including improved text conversion of academic PDFs. Finally, we reflect on our challenges and lessons learned during the dataset creation, in hope of encouraging more discussion about creating datasets for machine learning use.

Type

a