Document (#35173)

Author
Wang, J.
Title
¬An extensive study on automated Dewey Decimal Classification
Source
Journal of the American Society for Information Science and Technology. 60(2009) no.11, S.2269-2286
Year
2009
Abstract
In this paper, we present a theoretical analysis and extensive experiments on the automated assignment of Dewey Decimal Classification (DDC) classes to bibliographic data with a supervised machine-learning approach. Library classification systems, such as the DDC, impose great obstacles on state-of-art text categorization (TC) technologies, including deep hierarchy, data sparseness, and skewed distribution. We first analyze statistically the document and category distributions over the DDC, and discuss the obstacles imposed by bibliographic corpora and library classification schemes on TC technology. To overcome these obstacles, we propose an innovative algorithm to reshape the DDC structure into a balanced virtual tree by balancing the category distribution and flattening the hierarchy. To improve the classification effectiveness to a level acceptable to real-world applications, we propose an interactive classification model that is able to predict a class of any depth within a limited number of user interactions. The experiments are conducted on a large bibliographic collection created by the Library of Congress within the science and technology domains over 10 years. With no more than three interactions, a classification accuracy of nearly 90% is achieved, thus providing a practical solution to the automatic bibliographic classification problem.
Theme
Automatisches Klassifizieren
Object
DDC

Similar documents (author)

  1. Wang, H.; Wang, C.: Ontologies for universal information systems (1995) 4.64
    4.63939 = sum of:
      4.63939 = weight(author_txt:wang in 3194) [ClassicSimilarity], result of:
        4.63939 = fieldWeight in 3194, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.5610886 = idf(docFreq=169, maxDocs=44218)
          0.5 = fieldNorm(doc=3194)
    
  2. Wang, F.; Wang, X.: Tracing theory diffusion : a text mining and citation-based analysis of TAM (2020) 4.64
    4.63939 = sum of:
      4.63939 = weight(author_txt:wang in 5980) [ClassicSimilarity], result of:
        4.63939 = fieldWeight in 5980, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.5610886 = idf(docFreq=169, maxDocs=44218)
          0.5 = fieldNorm(doc=5980)
    
  3. Wang, C.: ¬The online catalogue, subject access and user reactions : a review (1985) 4.10
    4.1006804 = sum of:
      4.1006804 = weight(author_txt:wang in 986) [ClassicSimilarity], result of:
        4.1006804 = fieldWeight in 986, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.5610886 = idf(docFreq=169, maxDocs=44218)
          0.625 = fieldNorm(doc=986)
    
  4. Wang, C.: Bibliometrics : a textbook (1990) 4.10
    4.1006804 = sum of:
      4.1006804 = weight(author_txt:wang in 5040) [ClassicSimilarity], result of:
        4.1006804 = fieldWeight in 5040, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.5610886 = idf(docFreq=169, maxDocs=44218)
          0.625 = fieldNorm(doc=5040)
    
  5. Wang, P.: Users' information needs at different stages of a research project : a cognitive view (1997) 4.10
    4.1006804 = sum of:
      4.1006804 = weight(author_txt:wang in 320) [ClassicSimilarity], result of:
        4.1006804 = fieldWeight in 320, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.5610886 = idf(docFreq=169, maxDocs=44218)
          0.625 = fieldNorm(doc=320)
    

Similar documents (content)

  1. Riesthuis, G.J.A.: Fiction in need of transcending traditional classification (1997) 0.16
    0.15613233 = sum of:
      0.15613233 = product of:
        0.9758271 = sum of:
          0.047111254 = weight(abstract_txt:library in 1808) [ClassicSimilarity], result of:
            0.047111254 = score(doc=1808,freq=1.0), product of:
              0.078846 = queryWeight, product of:
                1.2810447 = boost
                3.1867187 = idf(docFreq=4964, maxDocs=44218)
                0.019313974 = queryNorm
              0.59750974 = fieldWeight in 1808, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1867187 = idf(docFreq=4964, maxDocs=44218)
                0.1875 = fieldNorm(doc=1808)
          0.19463195 = weight(abstract_txt:dewey in 1808) [ClassicSimilarity], result of:
            0.19463195 = score(doc=1808,freq=1.0), product of:
              0.1773409 = queryWeight, product of:
                1.5686761 = boost
                5.853343 = idf(docFreq=344, maxDocs=44218)
                0.019313974 = queryNorm
              1.0975018 = fieldWeight in 1808, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.853343 = idf(docFreq=344, maxDocs=44218)
                0.1875 = fieldNorm(doc=1808)
          0.30630544 = weight(abstract_txt:decimal in 1808) [ClassicSimilarity], result of:
            0.30630544 = score(doc=1808,freq=2.0), product of:
              0.19044048 = queryWeight, product of:
                1.6255804 = boost
                6.0656753 = idf(docFreq=278, maxDocs=44218)
                0.019313974 = queryNorm
              1.6084051 = fieldWeight in 1808, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.0656753 = idf(docFreq=278, maxDocs=44218)
                0.1875 = fieldNorm(doc=1808)
          0.42777848 = weight(abstract_txt:classification in 1808) [ClassicSimilarity], result of:
            0.42777848 = score(doc=1808,freq=3.0), product of:
              0.32995775 = queryWeight, product of:
                4.2794504 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.019313974 = queryNorm
              1.2964644 = fieldWeight in 1808, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.1875 = fieldNorm(doc=1808)
        0.16 = coord(4/25)
    
  2. Rafferty, P.: ¬The representation of knowledge in library classification schemes (2001) 0.16
    0.15563178 = sum of:
      0.15563178 = product of:
        0.5558278 = sum of:
          0.023896093 = weight(abstract_txt:within in 640) [ClassicSimilarity], result of:
            0.023896093 = score(doc=640,freq=1.0), product of:
              0.091123745 = queryWeight, product of:
                1.1244615 = boost
                4.195805 = idf(docFreq=1809, maxDocs=44218)
                0.019313974 = queryNorm
              0.26223782 = fieldWeight in 640, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.195805 = idf(docFreq=1809, maxDocs=44218)
                0.0625 = fieldNorm(doc=640)
          0.024737507 = weight(abstract_txt:over in 640) [ClassicSimilarity], result of:
            0.024737507 = score(doc=640,freq=1.0), product of:
              0.093250446 = queryWeight, product of:
                1.1375076 = boost
                4.244485 = idf(docFreq=1723, maxDocs=44218)
                0.019313974 = queryNorm
              0.2652803 = fieldWeight in 640, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.244485 = idf(docFreq=1723, maxDocs=44218)
                0.0625 = fieldNorm(doc=640)
          0.01570375 = weight(abstract_txt:library in 640) [ClassicSimilarity], result of:
            0.01570375 = score(doc=640,freq=1.0), product of:
              0.078846 = queryWeight, product of:
                1.2810447 = boost
                3.1867187 = idf(docFreq=4964, maxDocs=44218)
                0.019313974 = queryNorm
              0.19916992 = fieldWeight in 640, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1867187 = idf(docFreq=4964, maxDocs=44218)
                0.0625 = fieldNorm(doc=640)
          0.06487732 = weight(abstract_txt:dewey in 640) [ClassicSimilarity], result of:
            0.06487732 = score(doc=640,freq=1.0), product of:
              0.1773409 = queryWeight, product of:
                1.5686761 = boost
                5.853343 = idf(docFreq=344, maxDocs=44218)
                0.019313974 = queryNorm
              0.36583394 = fieldWeight in 640, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.853343 = idf(docFreq=344, maxDocs=44218)
                0.0625 = fieldNorm(doc=640)
          0.07219688 = weight(abstract_txt:decimal in 640) [ClassicSimilarity], result of:
            0.07219688 = score(doc=640,freq=1.0), product of:
              0.19044048 = queryWeight, product of:
                1.6255804 = boost
                6.0656753 = idf(docFreq=278, maxDocs=44218)
                0.019313974 = queryNorm
              0.3791047 = fieldWeight in 640, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0656753 = idf(docFreq=278, maxDocs=44218)
                0.0625 = fieldNorm(doc=640)
          0.06923058 = weight(abstract_txt:bibliographic in 640) [ClassicSimilarity], result of:
            0.06923058 = score(doc=640,freq=2.0), product of:
              0.18518777 = queryWeight, product of:
                2.2669919 = boost
                4.229516 = idf(docFreq=1749, maxDocs=44218)
                0.019313974 = queryNorm
              0.3738399 = fieldWeight in 640, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.229516 = idf(docFreq=1749, maxDocs=44218)
                0.0625 = fieldNorm(doc=640)
          0.28518566 = weight(abstract_txt:classification in 640) [ClassicSimilarity], result of:
            0.28518566 = score(doc=640,freq=12.0), product of:
              0.32995775 = queryWeight, product of:
                4.2794504 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.019313974 = queryNorm
              0.8643096 = fieldWeight in 640, product of:
                3.4641016 = tf(freq=12.0), with freq of:
                  12.0 = termFreq=12.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.0625 = fieldNorm(doc=640)
        0.28 = coord(7/25)
    
  3. Mitchell, J.S.: DDC21 and beyond : the Dewey Decimal Classification prepares for the future (1995) 0.15
    0.15065104 = sum of:
      0.15065104 = product of:
        0.7532552 = sum of:
          0.085062385 = weight(abstract_txt:distribution in 5564) [ClassicSimilarity], result of:
            0.085062385 = score(doc=5564,freq=1.0), product of:
              0.16212295 = queryWeight, product of:
                1.4998612 = boost
                5.596568 = idf(docFreq=445, maxDocs=44218)
                0.019313974 = queryNorm
              0.52467823 = fieldWeight in 5564, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.596568 = idf(docFreq=445, maxDocs=44218)
                0.09375 = fieldNorm(doc=5564)
          0.19463195 = weight(abstract_txt:dewey in 5564) [ClassicSimilarity], result of:
            0.19463195 = score(doc=5564,freq=4.0), product of:
              0.1773409 = queryWeight, product of:
                1.5686761 = boost
                5.853343 = idf(docFreq=344, maxDocs=44218)
                0.019313974 = queryNorm
              1.0975018 = fieldWeight in 5564, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.853343 = idf(docFreq=344, maxDocs=44218)
                0.09375 = fieldNorm(doc=5564)
          0.15315272 = weight(abstract_txt:decimal in 5564) [ClassicSimilarity], result of:
            0.15315272 = score(doc=5564,freq=2.0), product of:
              0.19044048 = queryWeight, product of:
                1.6255804 = boost
                6.0656753 = idf(docFreq=278, maxDocs=44218)
                0.019313974 = queryNorm
              0.80420256 = fieldWeight in 5564, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.0656753 = idf(docFreq=278, maxDocs=44218)
                0.09375 = fieldNorm(doc=5564)
          0.07343012 = weight(abstract_txt:bibliographic in 5564) [ClassicSimilarity], result of:
            0.07343012 = score(doc=5564,freq=1.0), product of:
              0.18518777 = queryWeight, product of:
                2.2669919 = boost
                4.229516 = idf(docFreq=1749, maxDocs=44218)
                0.019313974 = queryNorm
              0.39651713 = fieldWeight in 5564, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.229516 = idf(docFreq=1749, maxDocs=44218)
                0.09375 = fieldNorm(doc=5564)
          0.24697803 = weight(abstract_txt:classification in 5564) [ClassicSimilarity], result of:
            0.24697803 = score(doc=5564,freq=4.0), product of:
              0.32995775 = queryWeight, product of:
                4.2794504 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.019313974 = queryNorm
              0.7485141 = fieldWeight in 5564, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.09375 = fieldNorm(doc=5564)
        0.2 = coord(5/25)
    
  4. Olson, H.A.: ¬The ubiquitous hierarchy : an army to overcome the threat of a mob (2004) 0.14
    0.14389405 = sum of:
      0.14389405 = product of:
        0.7194702 = sum of:
          0.027481565 = weight(abstract_txt:library in 833) [ClassicSimilarity], result of:
            0.027481565 = score(doc=833,freq=1.0), product of:
              0.078846 = queryWeight, product of:
                1.2810447 = boost
                3.1867187 = idf(docFreq=4964, maxDocs=44218)
                0.019313974 = queryNorm
              0.34854737 = fieldWeight in 833, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1867187 = idf(docFreq=4964, maxDocs=44218)
                0.109375 = fieldNorm(doc=833)
          0.19664891 = weight(abstract_txt:dewey in 833) [ClassicSimilarity], result of:
            0.19664891 = score(doc=833,freq=3.0), product of:
              0.1773409 = queryWeight, product of:
                1.5686761 = boost
                5.853343 = idf(docFreq=344, maxDocs=44218)
                0.019313974 = queryNorm
              1.1088752 = fieldWeight in 833, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.853343 = idf(docFreq=344, maxDocs=44218)
                0.109375 = fieldNorm(doc=833)
          0.12634455 = weight(abstract_txt:decimal in 833) [ClassicSimilarity], result of:
            0.12634455 = score(doc=833,freq=1.0), product of:
              0.19044048 = queryWeight, product of:
                1.6255804 = boost
                6.0656753 = idf(docFreq=278, maxDocs=44218)
                0.019313974 = queryNorm
              0.66343325 = fieldWeight in 833, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0656753 = idf(docFreq=278, maxDocs=44218)
                0.109375 = fieldNorm(doc=833)
          0.22492468 = weight(abstract_txt:hierarchy in 833) [ClassicSimilarity], result of:
            0.22492468 = score(doc=833,freq=2.0), product of:
              0.22202559 = queryWeight, product of:
                1.755215 = boost
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.019313974 = queryNorm
              1.0130575 = fieldWeight in 833, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.5493927 = idf(docFreq=171, maxDocs=44218)
                0.109375 = fieldNorm(doc=833)
          0.14407052 = weight(abstract_txt:classification in 833) [ClassicSimilarity], result of:
            0.14407052 = score(doc=833,freq=1.0), product of:
              0.32995775 = queryWeight, product of:
                4.2794504 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.019313974 = queryNorm
              0.43663323 = fieldWeight in 833, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.109375 = fieldNorm(doc=833)
        0.2 = coord(5/25)
    
  5. Jouguelet, S.: Various applications of the Dewey Decimal Classification at the Bibliothèque Nationale de France (1998) 0.14
    0.14280075 = sum of:
      0.14280075 = product of:
        0.71400374 = sum of:
          0.05913981 = weight(abstract_txt:within in 767) [ClassicSimilarity], result of:
            0.05913981 = score(doc=767,freq=2.0), product of:
              0.091123745 = queryWeight, product of:
                1.1244615 = boost
                4.195805 = idf(docFreq=1809, maxDocs=44218)
                0.019313974 = queryNorm
              0.6490055 = fieldWeight in 767, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.195805 = idf(docFreq=1809, maxDocs=44218)
                0.109375 = fieldNorm(doc=767)
          0.027481565 = weight(abstract_txt:library in 767) [ClassicSimilarity], result of:
            0.027481565 = score(doc=767,freq=1.0), product of:
              0.078846 = queryWeight, product of:
                1.2810447 = boost
                3.1867187 = idf(docFreq=4964, maxDocs=44218)
                0.019313974 = queryNorm
              0.34854737 = fieldWeight in 767, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1867187 = idf(docFreq=4964, maxDocs=44218)
                0.109375 = fieldNorm(doc=767)
          0.16056316 = weight(abstract_txt:dewey in 767) [ClassicSimilarity], result of:
            0.16056316 = score(doc=767,freq=2.0), product of:
              0.1773409 = queryWeight, product of:
                1.5686761 = boost
                5.853343 = idf(docFreq=344, maxDocs=44218)
                0.019313974 = queryNorm
              0.90539277 = fieldWeight in 767, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.853343 = idf(docFreq=344, maxDocs=44218)
                0.109375 = fieldNorm(doc=767)
          0.17867817 = weight(abstract_txt:decimal in 767) [ClassicSimilarity], result of:
            0.17867817 = score(doc=767,freq=2.0), product of:
              0.19044048 = queryWeight, product of:
                1.6255804 = boost
                6.0656753 = idf(docFreq=278, maxDocs=44218)
                0.019313974 = queryNorm
              0.9382363 = fieldWeight in 767, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.0656753 = idf(docFreq=278, maxDocs=44218)
                0.109375 = fieldNorm(doc=767)
          0.28814104 = weight(abstract_txt:classification in 767) [ClassicSimilarity], result of:
            0.28814104 = score(doc=767,freq=4.0), product of:
              0.32995775 = queryWeight, product of:
                4.2794504 = boost
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.019313974 = queryNorm
              0.87326646 = fieldWeight in 767, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.9920752 = idf(docFreq=2218, maxDocs=44218)
                0.109375 = fieldNorm(doc=767)
        0.2 = coord(5/25)