Document (#43722)

Author
Lowe, D.B.
Dollinger, I.
Koster, T.
Herbert, B.E.
Title
Text mining for type of research classification
Source
Cataloging and classification quarterly. 59(2021) no.8, p.815-834
Year
2021
Abstract
This project brought together undergraduate students in Computer Science with librarians to mine abstracts of articles from the Texas A&M University Libraries' institutional repository, OAKTrust, in order to probe the creation of new metadata to improve discovery and use. The mining operation task consisted simply of classifying the articles into two categories of research type: basic research ("for understanding," "curiosity-based," or "knowledge-based") and applied research ("use-based"). These categories are fundamental especially for funders but are also important to researchers. The mining-to-classification steps took several iterations, but ultimately, we achieved good results with the toolkit BERT (Bidirectional Encoder Representations from Transformers). The project and its workflows represent a preview of what may lie ahead in the future of crafting metadata using text mining techniques to enhance discoverability.
Content
Vgl.: https://doi.org/10.1080/01639374.2021.1998281.
Footnote
Teil eines Themenheftes: Artificial intelligence (AI) and automated processes for subject sccess
Theme
Automatisches Indexieren
Data Mining

Similar documents (author)

  1. Lowe, D.: Leverhulme Trust award to catalogue the archive of Stefan Heym (1995) 5.87
    5.871439 = sum of:
      5.871439 = weight(author_txt:lowe in 3688) [ClassicSimilarity], result of:
        5.871439 = fieldWeight in 3688, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.394302 = idf(docFreq=9, maxDocs=44218)
          0.625 = fieldNorm(doc=3688)
    
  2. Steichen, B.; Lowe, R.: How do multilingual users search? : An investigation of query and result list language choices (2021) 4.70
    4.697151 = sum of:
      4.697151 = weight(author_txt:lowe in 246) [ClassicSimilarity], result of:
        4.697151 = fieldWeight in 246, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.394302 = idf(docFreq=9, maxDocs=44218)
          0.5 = fieldNorm(doc=246)
    
  3. Bartolo, L.M.; Lowe, C.S.; Glotzer, S.C.: Information management of microstructures : non-print, multidisciplinary information in a materials science digital library (2004) 3.52
    3.5228634 = sum of:
      3.5228634 = weight(author_txt:lowe in 2669) [ClassicSimilarity], result of:
        3.5228634 = fieldWeight in 2669, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.394302 = idf(docFreq=9, maxDocs=44218)
          0.375 = fieldNorm(doc=2669)
    
  4. Spitzer, K.L.; Eisenberg, M.B.; Lowe, C.A.: Information literacy : essential skills for the information age (1998) 3.52
    3.5228634 = sum of:
      3.5228634 = weight(author_txt:lowe in 3682) [ClassicSimilarity], result of:
        3.5228634 = fieldWeight in 3682, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.394302 = idf(docFreq=9, maxDocs=44218)
          0.375 = fieldNorm(doc=3682)
    
  5. Spitzer, K.L.; Eisenberg, M.B.; Lowe, C.A.: Information literacy : essential skills for the information age (2004) 3.52
    3.5228634 = sum of:
      3.5228634 = weight(author_txt:lowe in 3686) [ClassicSimilarity], result of:
        3.5228634 = fieldWeight in 3686, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.394302 = idf(docFreq=9, maxDocs=44218)
          0.375 = fieldNorm(doc=3686)
    

Similar documents (content)

  1. Chou, C.; Chu, T.: ¬An analysis of BERT (NLP) for assisted subject indexing for Project Gutenberg (2022) 0.22
    0.21556632 = sum of:
      0.21556632 = product of:
        0.898193 = sum of:
          0.032906696 = weight(abstract_txt:classification in 1139) [ClassicSimilarity], result of:
            0.032906696 = score(doc=1139,freq=1.0), product of:
              0.08792538 = queryWeight, product of:
                1.0383112 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.021212311 = queryNorm
              0.37425706 = fieldWeight in 1139, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.09375 = fieldNorm(doc=1139)
          0.043412913 = weight(abstract_txt:project in 1139) [ClassicSimilarity], result of:
            0.043412913 = score(doc=1139,freq=1.0), product of:
              0.105763875 = queryWeight, product of:
                1.1387781 = boost
                4.378348 = idf(docFreq=1507, maxDocs=44218)
                0.021212311 = queryNorm
              0.41047013 = fieldWeight in 1139, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.378348 = idf(docFreq=1507, maxDocs=44218)
                0.09375 = fieldNorm(doc=1139)
          0.19694589 = weight(abstract_txt:bidirectional in 1139) [ClassicSimilarity], result of:
            0.19694589 = score(doc=1139,freq=1.0), product of:
              0.23004496 = queryWeight, product of:
                1.187577 = boost
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.021212311 = queryNorm
              0.85611916 = fieldWeight in 1139, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.131938 = idf(docFreq=12, maxDocs=44218)
                0.09375 = fieldNorm(doc=1139)
          0.20795326 = weight(abstract_txt:encoder in 1139) [ClassicSimilarity], result of:
            0.20795326 = score(doc=1139,freq=1.0), product of:
              0.23853855 = queryWeight, product of:
                1.2093018 = boost
                9.298992 = idf(docFreq=10, maxDocs=44218)
                0.021212311 = queryNorm
              0.8717805 = fieldWeight in 1139, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.298992 = idf(docFreq=10, maxDocs=44218)
                0.09375 = fieldNorm(doc=1139)
          0.38401064 = weight(abstract_txt:bert in 1139) [ClassicSimilarity], result of:
            0.38401064 = score(doc=1139,freq=3.0), product of:
              0.24894486 = queryWeight, product of:
                1.2353983 = boost
                9.499662 = idf(docFreq=8, maxDocs=44218)
                0.021212311 = queryNorm
              1.542553 = fieldWeight in 1139, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.499662 = idf(docFreq=8, maxDocs=44218)
                0.09375 = fieldNorm(doc=1139)
          0.03296361 = weight(abstract_txt:research in 1139) [ClassicSimilarity], result of:
            0.03296361 = score(doc=1139,freq=1.0), product of:
              0.110906735 = queryWeight, product of:
                1.649166 = boost
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.021212311 = queryNorm
              0.2972192 = fieldWeight in 1139, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.09375 = fieldNorm(doc=1139)
        0.24 = coord(6/25)
    
  2. Perovsek, M.; Kranjca, J.; Erjaveca, T.; Cestnika, B.; Lavraca, N.: TextFlows : a visual programming platform for text mining and natural language processing (2016) 0.14
    0.14405079 = sum of:
      0.14405079 = product of:
        0.72025394 = sum of:
          0.19597936 = weight(abstract_txt:workflows in 2697) [ClassicSimilarity], result of:
            0.19597936 = score(doc=2697,freq=4.0), product of:
              0.16311322 = queryWeight, product of:
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.021212311 = queryNorm
              1.2014928 = fieldWeight in 2697, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.078125 = fieldNorm(doc=2697)
          0.06981889 = weight(abstract_txt:text in 2697) [ClassicSimilarity], result of:
            0.06981889 = score(doc=2697,freq=6.0), product of:
              0.09022159 = queryWeight, product of:
                1.0517818 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.021212311 = queryNorm
              0.77386016 = fieldWeight in 2697, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=2697)
          0.020947082 = weight(abstract_txt:based in 2697) [ClassicSimilarity], result of:
            0.020947082 = score(doc=2697,freq=1.0), product of:
              0.084105626 = queryWeight, product of:
                1.2437371 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.021212311 = queryNorm
              0.24905685 = fieldWeight in 2697, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.078125 = fieldNorm(doc=2697)
          0.027469674 = weight(abstract_txt:research in 2697) [ClassicSimilarity], result of:
            0.027469674 = score(doc=2697,freq=1.0), product of:
              0.110906735 = queryWeight, product of:
                1.649166 = boost
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.021212311 = queryNorm
              0.24768265 = fieldWeight in 2697, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.078125 = fieldNorm(doc=2697)
          0.40603897 = weight(abstract_txt:mining in 2697) [ClassicSimilarity], result of:
            0.40603897 = score(doc=2697,freq=4.0), product of:
              0.42080486 = queryWeight, product of:
                3.2123716 = boost
                6.1754265 = idf(docFreq=249, maxDocs=44218)
                0.021212311 = queryNorm
              0.9649104 = fieldWeight in 2697, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.1754265 = idf(docFreq=249, maxDocs=44218)
                0.078125 = fieldNorm(doc=2697)
        0.2 = coord(5/25)
    
  3. Short, M.: Text mining and subject analysis for fiction; or, using machine learning and information extraction to assign subject headings to dime novels (2019) 0.12
    0.12168318 = sum of:
      0.12168318 = product of:
        0.6084159 = sum of:
          0.11758762 = weight(abstract_txt:workflows in 5481) [ClassicSimilarity], result of:
            0.11758762 = score(doc=5481,freq=1.0), product of:
              0.16311322 = queryWeight, product of:
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.021212311 = queryNorm
              0.7208957 = fieldWeight in 5481, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.09375 = fieldNorm(doc=5481)
          0.032906696 = weight(abstract_txt:classification in 5481) [ClassicSimilarity], result of:
            0.032906696 = score(doc=5481,freq=1.0), product of:
              0.08792538 = queryWeight, product of:
                1.0383112 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.021212311 = queryNorm
              0.37425706 = fieldWeight in 5481, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.09375 = fieldNorm(doc=5481)
          0.034204133 = weight(abstract_txt:text in 5481) [ClassicSimilarity], result of:
            0.034204133 = score(doc=5481,freq=1.0), product of:
              0.09022159 = queryWeight, product of:
                1.0517818 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.021212311 = queryNorm
              0.37911248 = fieldWeight in 5481, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.09375 = fieldNorm(doc=5481)
          0.18009411 = weight(abstract_txt:discoverability in 5481) [ClassicSimilarity], result of:
            0.18009411 = score(doc=5481,freq=1.0), product of:
              0.2167277 = queryWeight, product of:
                1.1526903 = boost
                8.863674 = idf(docFreq=16, maxDocs=44218)
                0.021212311 = queryNorm
              0.83096945 = fieldWeight in 5481, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.863674 = idf(docFreq=16, maxDocs=44218)
                0.09375 = fieldNorm(doc=5481)
          0.24362339 = weight(abstract_txt:mining in 5481) [ClassicSimilarity], result of:
            0.24362339 = score(doc=5481,freq=1.0), product of:
              0.42080486 = queryWeight, product of:
                3.2123716 = boost
                6.1754265 = idf(docFreq=249, maxDocs=44218)
                0.021212311 = queryNorm
              0.57894623 = fieldWeight in 5481, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.1754265 = idf(docFreq=249, maxDocs=44218)
                0.09375 = fieldNorm(doc=5481)
        0.2 = coord(5/25)
    
  4. Joo, S.; Choi, I.; Choi, N.: Topic analysis of the research domain in knowledge organization : a Latent Dirichlet Allocation approach (2018) 0.12
    0.11865269 = sum of:
      0.11865269 = product of:
        0.42375958 = sum of:
          0.021937797 = weight(abstract_txt:classification in 4304) [ClassicSimilarity], result of:
            0.021937797 = score(doc=4304,freq=1.0), product of:
              0.08792538 = queryWeight, product of:
                1.0383112 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.021212311 = queryNorm
              0.2495047 = fieldWeight in 4304, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.0625 = fieldNorm(doc=4304)
          0.03949553 = weight(abstract_txt:text in 4304) [ClassicSimilarity], result of:
            0.03949553 = score(doc=4304,freq=3.0), product of:
              0.09022159 = queryWeight, product of:
                1.0517818 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.021212311 = queryNorm
              0.4377614 = fieldWeight in 4304, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=4304)
          0.016757665 = weight(abstract_txt:based in 4304) [ClassicSimilarity], result of:
            0.016757665 = score(doc=4304,freq=1.0), product of:
              0.084105626 = queryWeight, product of:
                1.2437371 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.021212311 = queryNorm
              0.19924548 = fieldWeight in 4304, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.0625 = fieldNorm(doc=4304)
          0.037711013 = weight(abstract_txt:articles in 4304) [ClassicSimilarity], result of:
            0.037711013 = score(doc=4304,freq=1.0), product of:
              0.12617241 = queryWeight, product of:
                1.2438059 = boost
                4.7821565 = idf(docFreq=1006, maxDocs=44218)
                0.021212311 = queryNorm
              0.29888478 = fieldWeight in 4304, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7821565 = idf(docFreq=1006, maxDocs=44218)
                0.0625 = fieldNorm(doc=4304)
          0.04010414 = weight(abstract_txt:metadata in 4304) [ClassicSimilarity], result of:
            0.04010414 = score(doc=4304,freq=1.0), product of:
              0.13145539 = queryWeight, product of:
                1.2695787 = boost
                4.881247 = idf(docFreq=911, maxDocs=44218)
                0.021212311 = queryNorm
              0.30507794 = fieldWeight in 4304, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.881247 = idf(docFreq=911, maxDocs=44218)
                0.0625 = fieldNorm(doc=4304)
          0.038063094 = weight(abstract_txt:research in 4304) [ClassicSimilarity], result of:
            0.038063094 = score(doc=4304,freq=3.0), product of:
              0.110906735 = queryWeight, product of:
                1.649166 = boost
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.021212311 = queryNorm
              0.34319913 = fieldWeight in 4304, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.0625 = fieldNorm(doc=4304)
          0.22969033 = weight(abstract_txt:mining in 4304) [ClassicSimilarity], result of:
            0.22969033 = score(doc=4304,freq=2.0), product of:
              0.42080486 = queryWeight, product of:
                3.2123716 = boost
                6.1754265 = idf(docFreq=249, maxDocs=44218)
                0.021212311 = queryNorm
              0.54583573 = fieldWeight in 4304, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.1754265 = idf(docFreq=249, maxDocs=44218)
                0.0625 = fieldNorm(doc=4304)
        0.28 = coord(7/25)
    
  5. Altinel, B.; Ganiz, M.C.: Semantic text classification : a survey of past and recent advances (2018) 0.11
    0.11404083 = sum of:
      0.11404083 = product of:
        0.40728867 = sum of:
          0.06921062 = weight(abstract_txt:classification in 5051) [ClassicSimilarity], result of:
            0.06921062 = score(doc=5051,freq=13.0), product of:
              0.08792538 = queryWeight, product of:
                1.0383112 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.021212311 = queryNorm
              0.78715175 = fieldWeight in 5051, product of:
                3.6055512 = tf(freq=13.0), with freq of:
                  13.0 = termFreq=13.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.0546875 = fieldNorm(doc=5051)
          0.07193944 = weight(abstract_txt:text in 5051) [ClassicSimilarity], result of:
            0.07193944 = score(doc=5051,freq=13.0), product of:
              0.09022159 = queryWeight, product of:
                1.0517818 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.021212311 = queryNorm
              0.7973639 = fieldWeight in 5051, product of:
                3.6055512 = tf(freq=13.0), with freq of:
                  13.0 = termFreq=13.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0546875 = fieldNorm(doc=5051)
          0.025396988 = weight(abstract_txt:based in 5051) [ClassicSimilarity], result of:
            0.025396988 = score(doc=5051,freq=3.0), product of:
              0.084105626 = queryWeight, product of:
                1.2437371 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.021212311 = queryNorm
              0.3019654 = fieldWeight in 5051, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.0546875 = fieldNorm(doc=5051)
          0.037517365 = weight(abstract_txt:type in 5051) [ClassicSimilarity], result of:
            0.037517365 = score(doc=5051,freq=1.0), product of:
              0.13744695 = queryWeight, product of:
                1.2981892 = boost
                4.991248 = idf(docFreq=816, maxDocs=44218)
                0.021212311 = queryNorm
              0.27295887 = fieldWeight in 5051, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.991248 = idf(docFreq=816, maxDocs=44218)
                0.0546875 = fieldNorm(doc=5051)
          0.041881826 = weight(abstract_txt:categories in 5051) [ClassicSimilarity], result of:
            0.041881826 = score(doc=5051,freq=1.0), product of:
              0.14790992 = queryWeight, product of:
                1.3466945 = boost
                5.17774 = idf(docFreq=677, maxDocs=44218)
                0.021212311 = queryNorm
              0.28315765 = fieldWeight in 5051, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.17774 = idf(docFreq=677, maxDocs=44218)
                0.0546875 = fieldNorm(doc=5051)
          0.019228771 = weight(abstract_txt:research in 5051) [ClassicSimilarity], result of:
            0.019228771 = score(doc=5051,freq=1.0), product of:
              0.110906735 = queryWeight, product of:
                1.649166 = boost
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.021212311 = queryNorm
              0.17337786 = fieldWeight in 5051, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.170338 = idf(docFreq=5046, maxDocs=44218)
                0.0546875 = fieldNorm(doc=5051)
          0.14211364 = weight(abstract_txt:mining in 5051) [ClassicSimilarity], result of:
            0.14211364 = score(doc=5051,freq=1.0), product of:
              0.42080486 = queryWeight, product of:
                3.2123716 = boost
                6.1754265 = idf(docFreq=249, maxDocs=44218)
                0.021212311 = queryNorm
              0.33771864 = fieldWeight in 5051, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.1754265 = idf(docFreq=249, maxDocs=44218)
                0.0546875 = fieldNorm(doc=5051)
        0.28 = coord(7/25)