Search (172 results, page 1 of 9)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.18

0.17869775 = product of:
  0.31272104 = sum of:
    0.06973943 = product of:
      0.20921828 = sum of:
        0.20921828 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
          0.20921828 = score(doc=562,freq=2.0), product of:
            0.37226257 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.043909185 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.33333334 = coord(1/3)
    0.015916053 = weight(_text_:of in 562) [ClassicSimilarity], result of:
      0.015916053 = score(doc=562,freq=10.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.23179851 = fieldWeight in 562, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.20921828 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.20921828 = score(doc=562,freq=2.0), product of:
        0.37226257 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.043909185 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.017847266 = product of:
      0.035694532 = sum of:
        0.035694532 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
          0.035694532 = score(doc=562,freq=2.0), product of:
            0.15376249 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.043909185 = queryNorm
            0.23214069 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.5 = coord(1/2)
  0.5714286 = coord(4/7)

Abstract: Document representations for text classification are typically based on the classical Bag-Of-Words paradigm. This approach comes with deficiencies that motivate the integration of features on a higher semantic level than single words. In this paper we propose an enhancement of the classical document representation through concepts extracted from background knowledge. Boosting is used for actual classification. Experimental evaluations on two well known text corpora support our approach through consistent improvement of the results.
Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32
Source: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), 1-4 November 2004, Brighton, UK

Wang, J.: ¬An extensive study on automated Dewey Decimal Classification (2009) 0.07

0.07437662 = product of:
  0.17354545 = sum of:
    0.01569344 = weight(_text_:of in 3172) [ClassicSimilarity], result of:
      0.01569344 = score(doc=3172,freq=14.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.22855641 = fieldWeight in 3172, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3172)
    0.055198044 = weight(_text_:congress in 3172) [ClassicSimilarity], result of:
      0.055198044 = score(doc=3172,freq=2.0), product of:
        0.20946044 = queryWeight, product of:
          4.7703104 = idf(docFreq=1018, maxDocs=44218)
          0.043909185 = queryNorm
        0.26352492 = fieldWeight in 3172, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.7703104 = idf(docFreq=1018, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3172)
    0.102653965 = weight(_text_:distribution in 3172) [ClassicSimilarity], result of:
      0.102653965 = score(doc=3172,freq=4.0), product of:
        0.24019864 = queryWeight, product of:
          5.4703507 = idf(docFreq=505, maxDocs=44218)
          0.043909185 = queryNorm
        0.42737114 = fieldWeight in 3172, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.4703507 = idf(docFreq=505, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3172)
  0.42857143 = coord(3/7)

Abstract: In this paper, we present a theoretical analysis and extensive experiments on the automated assignment of Dewey Decimal Classification (DDC) classes to bibliographic data with a supervised machine-learning approach. Library classification systems, such as the DDC, impose great obstacles on state-of-art text categorization (TC) technologies, including deep hierarchy, data sparseness, and skewed distribution. We first analyze statistically the document and category distributions over the DDC, and discuss the obstacles imposed by bibliographic corpora and library classification schemes on TC technology. To overcome these obstacles, we propose an innovative algorithm to reshape the DDC structure into a balanced virtual tree by balancing the category distribution and flattening the hierarchy. To improve the classification effectiveness to a level acceptable to real-world applications, we propose an interactive classification model that is able to predict a class of any depth within a limited number of user interactions. The experiments are conducted on a large bibliographic collection created by the Library of Congress within the science and technology domains over 10 years. With no more than three interactions, a classification accuracy of nearly 90% is achieved, thus providing a practical solution to the automatic bibliographic classification problem.
Source: Journal of the American Society for Information Science and Technology. 60(2009) no.11, S.2269-2286

Egbert, J.; Biber, D.; Davies, M.: Developing a bottom-up, user-based method of web register classification (2015) 0.06

0.05679406 = product of:
  0.13251947 = sum of:
    0.027567413 = weight(_text_:of in 2158) [ClassicSimilarity], result of:
      0.027567413 = score(doc=2158,freq=30.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.4014868 = fieldWeight in 2158, product of:
          5.477226 = tf(freq=30.0), with freq of:
            30.0 = termFreq=30.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.046875 = fieldNorm(doc=2158)
    0.08710478 = weight(_text_:distribution in 2158) [ClassicSimilarity], result of:
      0.08710478 = score(doc=2158,freq=2.0), product of:
        0.24019864 = queryWeight, product of:
          5.4703507 = idf(docFreq=505, maxDocs=44218)
          0.043909185 = queryNorm
        0.36263645 = fieldWeight in 2158, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.4703507 = idf(docFreq=505, maxDocs=44218)
          0.046875 = fieldNorm(doc=2158)
    0.017847266 = product of:
      0.035694532 = sum of:
        0.035694532 = weight(_text_:22 in 2158) [ClassicSimilarity], result of:
          0.035694532 = score(doc=2158,freq=2.0), product of:
            0.15376249 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.043909185 = queryNorm
            0.23214069 = fieldWeight in 2158, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=2158)
      0.5 = coord(1/2)
  0.42857143 = coord(3/7)

Abstract: This paper introduces a project to develop a reliable, cost-effective method for classifying Internet texts into register categories, and apply that approach to the analysis of a large corpus of web documents. To date, the project has proceeded in 2 key phases. First, we developed a bottom-up method for web register classification, asking end users of the web to utilize a decision-tree survey to code relevant situational characteristics of web documents, resulting in a bottom-up identification of register and subregister categories. We present details regarding the development and testing of this method through a series of 10 pilot studies. Then, in the second phase of our project we applied this procedure to a corpus of 53,000 web documents. An analysis of the results demonstrates the effectiveness of these methods for web register classification and provides a preliminary description of the types and distribution of registers on the web.
Date: 4. 8.2015 19:22:04
Source: Journal of the Association for Information Science and Technology. 66(2015) no.9, S.1817-1831

Zhu, W.Z.; Allen, R.B.: Document clustering using the LSI subspace signature model (2013) 0.05

0.05360762 = product of:
  0.12508444 = sum of:
    0.020132389 = weight(_text_:of in 690) [ClassicSimilarity], result of:
      0.020132389 = score(doc=690,freq=16.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.2932045 = fieldWeight in 690, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.046875 = fieldNorm(doc=690)
    0.08710478 = weight(_text_:distribution in 690) [ClassicSimilarity], result of:
      0.08710478 = score(doc=690,freq=2.0), product of:
        0.24019864 = queryWeight, product of:
          5.4703507 = idf(docFreq=505, maxDocs=44218)
          0.043909185 = queryNorm
        0.36263645 = fieldWeight in 690, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.4703507 = idf(docFreq=505, maxDocs=44218)
          0.046875 = fieldNorm(doc=690)
    0.017847266 = product of:
      0.035694532 = sum of:
        0.035694532 = weight(_text_:22 in 690) [ClassicSimilarity], result of:
          0.035694532 = score(doc=690,freq=2.0), product of:
            0.15376249 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.043909185 = queryNorm
            0.23214069 = fieldWeight in 690, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=690)
      0.5 = coord(1/2)
  0.42857143 = coord(3/7)

Abstract: We describe the latent semantic indexing subspace signature model (LSISSM) for semantic content representation of unstructured text. Grounded on singular value decomposition, the model represents terms and documents by the distribution signatures of their statistical contribution across the top-ranking latent concept dimensions. LSISSM matches term signatures with document signatures according to their mapping coherence between latent semantic indexing (LSI) term subspace and LSI document subspace. LSISSM does feature reduction and finds a low-rank approximation of scalable and sparse term-document matrices. Experiments demonstrate that this approach significantly improves the performance of major clustering algorithms such as standard K-means and self-organizing maps compared with the vector space model and the traditional LSI model. The unique contribution ranking mechanism in LSISSM also improves the initialization of standard K-means compared with random seeding procedure, which sometimes causes low efficiency and effectiveness of clustering. A two-stage initialization strategy based on LSISSM significantly reduces the running time of standard K-means procedures.
Date: 23. 3.2013 13:22:36
Source: Journal of the American Society for Information Science and Technology. 64(2013) no.4, S.844-860

Automatic classification research at OCLC (2002) 0.05

0.050309207 = product of:
  0.11738815 = sum of:
    0.021970814 = weight(_text_:of in 1563) [ClassicSimilarity], result of:
      0.021970814 = score(doc=1563,freq=14.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.31997898 = fieldWeight in 1563, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1563)
    0.074595526 = weight(_text_:cataloging in 1563) [ClassicSimilarity], result of:
      0.074595526 = score(doc=1563,freq=4.0), product of:
        0.17305137 = queryWeight, product of:
          3.9411201 = idf(docFreq=2334, maxDocs=44218)
          0.043909185 = queryNorm
        0.43106002 = fieldWeight in 1563, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.9411201 = idf(docFreq=2334, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1563)
    0.02082181 = product of:
      0.04164362 = sum of:
        0.04164362 = weight(_text_:22 in 1563) [ClassicSimilarity], result of:
          0.04164362 = score(doc=1563,freq=2.0), product of:
            0.15376249 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.043909185 = queryNorm
            0.2708308 = fieldWeight in 1563, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1563)
      0.5 = coord(1/2)
  0.42857143 = coord(3/7)

Abstract: OCLC enlists the cooperation of the world's libraries to make the written record of humankind's cultural heritage more accessible through electronic media. Part of this goal can be accomplished through the application of the principles of knowledge organization. We believe that cultural artifacts are effectively lost unless they are indexed, cataloged and classified. Accordingly, OCLC has developed products, sponsored research projects, and encouraged the participation in international standards communities whose outcome has been improved library classification schemes, cataloging productivity tools, and new proposals for the creation and maintenance of metadata. Though cataloging and classification requires expert intellectual effort, we recognize that at least some of the work must be automated if we hope to keep pace with cultural change
Date: 5. 5.2003 9:22:09

Mengle, S.; Goharian, N.: Passage detection using text classification (2009) 0.05
```
0.046648543 = product of:
  0.1088466 = sum of:
    0.021386553 = weight(_text_:of in 2765) [ClassicSimilarity], result of:
      0.021386553 = score(doc=2765,freq=26.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.31146988 = fieldWeight in 2765, product of:
          5.0990195 = tf(freq=26.0), with freq of:
            26.0 = termFreq=26.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2765)
    0.07258732 = weight(_text_:distribution in 2765) [ClassicSimilarity], result of:
      0.07258732 = score(doc=2765,freq=2.0), product of:
        0.24019864 = queryWeight, product of:
          5.4703507 = idf(docFreq=505, maxDocs=44218)
          0.043909185 = queryNorm
        0.30219704 = fieldWeight in 2765, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.4703507 = idf(docFreq=505, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2765)
    0.014872721 = product of:
      0.029745443 = sum of:
        0.029745443 = weight(_text_:22 in 2765) [ClassicSimilarity], result of:
          0.029745443 = score(doc=2765,freq=2.0), product of:
            0.15376249 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.043909185 = queryNorm
            0.19345059 = fieldWeight in 2765, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2765)
      0.5 = coord(1/2)
  0.42857143 = coord(3/7)
```
Abstract

Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retrieval is the determination of the degree of relevance of blocks of text, namely passages, comprising a document. Rather than determining the relevance of a document in its entirety, passage retrieval determines the relevance of the individual passages. As such, modified traditional information-retrieval techniques compare terms found in user queries with the individual passages to determine a similarity score for passages of interest. In passage detection, passages are classified into predetermined categories. More often than not, passage detection techniques are deployed to detect hidden paragraphs in documents. That is, to hide information, documents are injected with hidden text into passages. Rather than matching query terms against passages to determine their relevance, using text-mining techniques, the passages are classified. Those documents with hidden passages are defined as infected. Thus, simply stated, passage retrieval is the search for passages relevant to a user query, while passage detection is the classification of passages. That is, in passage detection, passages are labeled with one or more categories from a set of predetermined categories. We present a keyword-based dynamic passage approach (KDP) and demonstrate that KDP outperforms statistically significantly (99% confidence) the other document-splitting approaches by 12% to 18% in the passage detection and passage category-prediction tasks. Furthermore, we evaluate the effects of the feature selection, passage length, ambiguous passages, and finally training-data category distribution on passage-detection accuracy.

Date

22. 3.2009 19:14:43

Source

Journal of the American Society for Information Science and Technology. 60(2009) no.4, S.814-825

Frank, E.; Paynter, G.W.: Predicting Library of Congress Classifications from Library of Congress Subject Headings (2004) 0.04

0.044281144 = product of:
  0.154984 = sum of:
    0.022508696 = weight(_text_:of in 2218) [ClassicSimilarity], result of:
      0.022508696 = score(doc=2218,freq=20.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.32781258 = fieldWeight in 2218, product of:
          4.472136 = tf(freq=20.0), with freq of:
            20.0 = termFreq=20.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.046875 = fieldNorm(doc=2218)
    0.1324753 = weight(_text_:congress in 2218) [ClassicSimilarity], result of:
      0.1324753 = score(doc=2218,freq=8.0), product of:
        0.20946044 = queryWeight, product of:
          4.7703104 = idf(docFreq=1018, maxDocs=44218)
          0.043909185 = queryNorm
        0.63245976 = fieldWeight in 2218, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          4.7703104 = idf(docFreq=1018, maxDocs=44218)
          0.046875 = fieldNorm(doc=2218)
  0.2857143 = coord(2/7)

Abstract: This paper addresses the problem of automatically assigning a Library of Congress Classification (LCC) to a work given its set of Library of Congress Subject Headings (LCSH). LCCs are organized in a tree: The root node of this hierarchy comprises all possible topics, and leaf nodes correspond to the most specialized topic areas defined. We describe a procedure that, given a resource identified by its LCSH, automatically places that resource in the LCC hierarchy. The procedure uses machine learning techniques and training data from a large library catalog to learn a model that maps from sets of LCSH to classifications from the LCC tree. We present empirical results for our technique showing its accuracy an an independent collection of 50,000 LCSH/LCC pairs.
Source: Journal of the American Society for Information Science and technology. 55(2004) no.3, S.214-227

Godby, C. J.; Stuler, J.: ¬The Library of Congress Classification as a knowledge base for automatic subject categorization (2001) 0.04

0.04174866 = product of:
  0.14612031 = sum of:
    0.021221403 = weight(_text_:of in 1567) [ClassicSimilarity], result of:
      0.021221403 = score(doc=1567,freq=10.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.3090647 = fieldWeight in 1567, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0625 = fieldNorm(doc=1567)
    0.12489891 = weight(_text_:congress in 1567) [ClassicSimilarity], result of:
      0.12489891 = score(doc=1567,freq=4.0), product of:
        0.20946044 = queryWeight, product of:
          4.7703104 = idf(docFreq=1018, maxDocs=44218)
          0.043909185 = queryNorm
        0.5962888 = fieldWeight in 1567, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.7703104 = idf(docFreq=1018, maxDocs=44218)
          0.0625 = fieldNorm(doc=1567)
  0.2857143 = coord(2/7)

Abstract: This paper describes a set of experiments in adapting a subset of the Library of Congress Classification for use as a database for automatic classification. A high degree of concept integrity was obtained when subject headings were mapped from OCLC's WorldCat database and filtered using the log-likelihood statistic

Godby, C.J.; Stuler, J.: ¬The Library of Congress Classification as a knowledge base for automatic subject categorization : subject access issues (2003) 0.04

0.03703645 = product of:
  0.12962757 = sum of:
    0.02034102 = weight(_text_:of in 3962) [ClassicSimilarity], result of:
      0.02034102 = score(doc=3962,freq=12.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.29624295 = fieldWeight in 3962, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0546875 = fieldNorm(doc=3962)
    0.109286554 = weight(_text_:congress in 3962) [ClassicSimilarity], result of:
      0.109286554 = score(doc=3962,freq=4.0), product of:
        0.20946044 = queryWeight, product of:
          4.7703104 = idf(docFreq=1018, maxDocs=44218)
          0.043909185 = queryNorm
        0.5217527 = fieldWeight in 3962, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.7703104 = idf(docFreq=1018, maxDocs=44218)
          0.0546875 = fieldNorm(doc=3962)
  0.2857143 = coord(2/7)

Abstract: This paper describes a set of experiments in adapting a subset of the Library of Congress Classification for use as a database for automatic classification. A high degree of concept integrity was obtained when subject headings were mapped from OCLC's WorldCat database and filtered using the log-likelihood statistic.
Source: Subject retrieval in a networked environment: Proceedings of the IFLA Satellite Meeting held in Dublin, OH, 14-16 August 2001 and sponsored by the IFLA Classification and Indexing Section, the IFLA Information Technology Section and OCLC. Ed.: I.C. McIlwaine

Salles, T.; Rocha, L.; Gonçalves, M.A.; Almeida, J.M.; Mourão, F.; Meira Jr., W.; Viegas, F.: ¬A quantitative analysis of the temporal effects on automatic text classification (2016) 0.03
```
0.034950495 = product of:
  0.12232673 = sum of:
    0.019672766 = weight(_text_:of in 3014) [ClassicSimilarity], result of:
      0.019672766 = score(doc=3014,freq=22.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.28651062 = fieldWeight in 3014, product of:
          4.690416 = tf(freq=22.0), with freq of:
            22.0 = termFreq=22.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3014)
    0.102653965 = weight(_text_:distribution in 3014) [ClassicSimilarity], result of:
      0.102653965 = score(doc=3014,freq=4.0), product of:
        0.24019864 = queryWeight, product of:
          5.4703507 = idf(docFreq=505, maxDocs=44218)
          0.043909185 = queryNorm
        0.42737114 = fieldWeight in 3014, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.4703507 = idf(docFreq=505, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3014)
  0.2857143 = coord(2/7)
```
Abstract

Automatic text classification (TC) continues to be a relevant research topic and several TC algorithms have been proposed. However, the majority of TC algorithms assume that the underlying data distribution does not change over time. In this work, we are concerned with the challenges imposed by the temporal dynamics observed in textual data sets. We provide evidence of the existence of temporal effects in three textual data sets, reflected by variations observed over time in the class distribution, in the pairwise class similarities, and in the relationships between terms and classes. We then quantify, using a series of full factorial design experiments, the impact of these effects on four well-known TC algorithms. We show that these temporal effects affect each analyzed data set differently and that they restrict the performance of each considered TC algorithm to different extents. The reported quantitative analyses, which are the original contributions of this article, provide valuable new insights to better understand the behavior of TC algorithms when faced with nonstatic (temporal) data distributions and highlight important requirements for the proposal of more accurate classification models.

Source

Journal of the Association for Information Science and Technology. 67(2016) no.7, S.1639-1667

Larson, R.R.: Experiments in automatic Library of Congress Classification (1992) 0.03

0.033808924 = product of:
  0.118331224 = sum of:
    0.02465704 = weight(_text_:of in 1054) [ClassicSimilarity], result of:
      0.02465704 = score(doc=1054,freq=24.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.3591007 = fieldWeight in 1054, product of:
          4.8989797 = tf(freq=24.0), with freq of:
            24.0 = termFreq=24.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.046875 = fieldNorm(doc=1054)
    0.09367418 = weight(_text_:congress in 1054) [ClassicSimilarity], result of:
      0.09367418 = score(doc=1054,freq=4.0), product of:
        0.20946044 = queryWeight, product of:
          4.7703104 = idf(docFreq=1018, maxDocs=44218)
          0.043909185 = queryNorm
        0.4472166 = fieldWeight in 1054, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.7703104 = idf(docFreq=1018, maxDocs=44218)
          0.046875 = fieldNorm(doc=1054)
  0.2857143 = coord(2/7)

Abstract: This article presents the results of research into the automatic selection of Library of Congress Classification numbers based on the titles and subject headings in MARC records. The method used in this study was based on partial match retrieval techniques using various elements of new recors (i.e., those to be classified) as "queries", and a test database of classification clusters generated from previously classified MARC records. Sixty individual methods for automatic classification were tested on a set of 283 new records, using all combinations of four different partial match methods, five query types, and three representations of search terms. The results indicate that if the best method for a particular case can be determined, then up to 86% of the new records may be correctly classified. The single method with the best accuracy was able to select the correct classification for about 46% of the new records.
Source: Journal of the American Society for Information Science. 43(1992), S.130-148

Yang, Y.; Liu, X.: ¬A re-examination of text categorization methods (1999) 0.03

0.03378018 = product of:
  0.11823062 = sum of:
    0.016608374 = weight(_text_:of in 3386) [ClassicSimilarity], result of:
      0.016608374 = score(doc=3386,freq=8.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.24188137 = fieldWeight in 3386, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0546875 = fieldNorm(doc=3386)
    0.101622246 = weight(_text_:distribution in 3386) [ClassicSimilarity], result of:
      0.101622246 = score(doc=3386,freq=2.0), product of:
        0.24019864 = queryWeight, product of:
          5.4703507 = idf(docFreq=505, maxDocs=44218)
          0.043909185 = queryNorm
        0.42307585 = fieldWeight in 3386, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.4703507 = idf(docFreq=505, maxDocs=44218)
          0.0546875 = fieldNorm(doc=3386)
  0.2857143 = coord(2/7)

Abstract: This paper reports a controlled study with statistical significance tests an five text categorization methods: the Support Vector Machines (SVM), a k-Nearest Neighbor (kNN) classifier, a neural network (NNet) approach, the Linear Leastsquares Fit (LLSF) mapping and a Naive Bayes (NB) classifier. We focus an the robustness of these methods in dealing with a skewed category distribution, and their performance as function of the training-set category frequency. Our results show that SVM, kNN and LLSF significantly outperform NNet and NB when the number of positive training instances per category are small (less than ten, and that all the methods perform comparably when the categories are sufficiently common (over 300 instances).

Borodin, Y.; Polishchuk, V.; Mahmud, J.; Ramakrishnan, I.V.; Stent, A.: Live and learn from mistakes : a lightweight system for document classification (2013) 0.03
```
0.033480935 = product of:
  0.11718327 = sum of:
    0.014529302 = weight(_text_:of in 2722) [ClassicSimilarity], result of:
      0.014529302 = score(doc=2722,freq=12.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.21160212 = fieldWeight in 2722, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2722)
    0.102653965 = weight(_text_:distribution in 2722) [ClassicSimilarity], result of:
      0.102653965 = score(doc=2722,freq=4.0), product of:
        0.24019864 = queryWeight, product of:
          5.4703507 = idf(docFreq=505, maxDocs=44218)
          0.043909185 = queryNorm
        0.42737114 = fieldWeight in 2722, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.4703507 = idf(docFreq=505, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2722)
  0.2857143 = coord(2/7)
```
Abstract

We present a Life-Long Learning from Mistakes (3LM) algorithm for document classification, which could be used in various scenarios such as spam filtering, blog classification, and web resource categorization. We extend the ideas of online clustering and batch-mode centroid-based classification to online learning with negative feedback. The 3LM is a competitive learning algorithm, which avoids over-smoothing, characteristic of the centroid-based classifiers, by using a different class representative, which we call clusterhead. The clusterheads competing for vector-space dominance are drawn toward misclassified documents, eventually bringing the model to a "balanced state" for a fixed distribution of documents. Subsequently, the clusterheads oscillate between the misclassified documents, heuristically minimizing the rate of misclassifications, an NP-complete problem. Further, the 3LM algorithm prevents over-fitting by "leashing" the clusterheads to their respective centroids. A clusterhead provably converges if its class can be separated by a hyper-plane from all other classes. Lifelong learning with fixed learning rate allows 3LM to adapt to possibly changing distribution of the data and continually learn and unlearn document classes. We report on our experiments, which demonstrate high accuracy of document classification on Reuters21578, OHSUMED, and TREC07p-spam datasets. The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naïve Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.
Schaalje, G.B.; Blades, N.J.; Funai, T.: ¬An open-set size-adjusted Bayesian classifier for authorship attribution (2013) 0.03
```
0.030639194 = product of:
  0.107237175 = sum of:
    0.020132389 = weight(_text_:of in 1041) [ClassicSimilarity], result of:
      0.020132389 = score(doc=1041,freq=16.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.2932045 = fieldWeight in 1041, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.046875 = fieldNorm(doc=1041)
    0.08710478 = weight(_text_:distribution in 1041) [ClassicSimilarity], result of:
      0.08710478 = score(doc=1041,freq=2.0), product of:
        0.24019864 = queryWeight, product of:
          5.4703507 = idf(docFreq=505, maxDocs=44218)
          0.043909185 = queryNorm
        0.36263645 = fieldWeight in 1041, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.4703507 = idf(docFreq=505, maxDocs=44218)
          0.046875 = fieldNorm(doc=1041)
  0.2857143 = coord(2/7)
```
Abstract

Recent studies of authorship attribution have used machine-learning methods including regularized multinomial logistic regression, neural nets, support vector machines, and the nearest shrunken centroid classifier to identify likely authors of disputed texts. These methods are all limited by an inability to perform open-set classification and account for text and corpus size. We propose a customized Bayesian logit-normal-beta-binomial classification model for supervised authorship attribution. The model is based on the beta-binomial distribution with an explicit inverse relationship between extra-binomial variation and text size. The model internally estimates the relationship of extra-binomial variation to text size, and uses Markov Chain Monte Carlo (MCMC) to produce distributions of posterior authorship probabilities instead of point estimates. We illustrate the method by training the machine-learning methods as well as the open-set Bayesian classifier on undisputed papers of The Federalist, and testing the method on documents historically attributed to Alexander Hamilton, John Jay, and James Madison. The Bayesian classifier was the best classifier of these texts.

Source

Journal of the American Society for Information Science and Technology. 64(2013) no.9, S.1815-1825

Subramanian, S.; Shafer, K.E.: Clustering (1998) 0.03

0.027400116 = product of:
  0.0959004 = sum of:
    0.020547535 = weight(_text_:of in 1103) [ClassicSimilarity], result of:
      0.020547535 = score(doc=1103,freq=6.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.2992506 = fieldWeight in 1103, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.078125 = fieldNorm(doc=1103)
    0.07535286 = weight(_text_:cataloging in 1103) [ClassicSimilarity], result of:
      0.07535286 = score(doc=1103,freq=2.0), product of:
        0.17305137 = queryWeight, product of:
          3.9411201 = idf(docFreq=2334, maxDocs=44218)
          0.043909185 = queryNorm
        0.43543637 = fieldWeight in 1103, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.9411201 = idf(docFreq=2334, maxDocs=44218)
          0.078125 = fieldNorm(doc=1103)
  0.2857143 = coord(2/7)

Abstract: This article presents our exploration of computer science clustering algorithms as they relate to the Scorpion system. Scorpion is a research project at OCLC that explores the indexing and cataloging of electronic resources. For a more complete description of the Scorpion, please visit the Scorpion Web site at <http://purl.oclc.org/scorpion>

Dolin, R.; Agrawal, D.; El Abbadi, A.; Pearlman, J.: Using automated classification for summarizing and selecting heterogeneous information sources (1998) 0.03
```
0.0253561 = product of:
  0.08874635 = sum of:
    0.022508696 = weight(_text_:of in 316) [ClassicSimilarity], result of:
      0.022508696 = score(doc=316,freq=20.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.32781258 = fieldWeight in 316, product of:
          4.472136 = tf(freq=20.0), with freq of:
            20.0 = termFreq=20.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.046875 = fieldNorm(doc=316)
    0.06623765 = weight(_text_:congress in 316) [ClassicSimilarity], result of:
      0.06623765 = score(doc=316,freq=2.0), product of:
        0.20946044 = queryWeight, product of:
          4.7703104 = idf(docFreq=1018, maxDocs=44218)
          0.043909185 = queryNorm
        0.31622988 = fieldWeight in 316, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.7703104 = idf(docFreq=1018, maxDocs=44218)
          0.046875 = fieldNorm(doc=316)
  0.2857143 = coord(2/7)
```
Abstract

Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. Important sources of information include not only traditional databases with structured data and queries, but also increasing numbers of non-traditional, semi- or unstructured collections such as Web sites, FTP archives, etc. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC) [10], within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. For such a system to be useful in a large-scale distributed environment, it must be easy to use for both collection managers and users. As a result, it must be possible to classify documents automatically within a classification scheme. Furthermore, there must be a straightforward and intuitive interface with which the user may use the scheme to assist in information retrieval (IR).
Prabowo, R.; Jackson, M.; Burden, P.; Knoell, H.-D.: Ontology-based automatic classification for the Web pages : design, implementation and evaluation (2002) 0.02
```
0.024305651 = product of:
  0.085069776 = sum of:
    0.018832127 = weight(_text_:of in 3383) [ClassicSimilarity], result of:
      0.018832127 = score(doc=3383,freq=14.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.2742677 = fieldWeight in 3383, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.046875 = fieldNorm(doc=3383)
    0.06623765 = weight(_text_:congress in 3383) [ClassicSimilarity], result of:
      0.06623765 = score(doc=3383,freq=2.0), product of:
        0.20946044 = queryWeight, product of:
          4.7703104 = idf(docFreq=1018, maxDocs=44218)
          0.043909185 = queryNorm
        0.31622988 = fieldWeight in 3383, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.7703104 = idf(docFreq=1018, maxDocs=44218)
          0.046875 = fieldNorm(doc=3383)
  0.2857143 = coord(2/7)
```
Abstract

In recent years, we have witnessed the continual growth in the use of ontologies in order to provide a mechanism to enable machine reasoning. This paper describes an automatic classifier, which focuses on the use of ontologies for classifying Web pages with respect to the Dewey Decimal Classification (DDC) and Library of Congress Classification (LCC) schemes. Firstly, we explain how these ontologies can be built in a modular fashion, and mapped into DDC and LCC. Secondly, we propose the formal definition of a DDC-LCC and an ontology-classification-scheme mapping. Thirdly, we explain the way the classifier uses these ontologies to assist classification. Finally, an experiment in which the accuracy of the classifier was evaluated is presented. The experiment shows that our approach results an improved classification in terms of accuracy. This improvement, however, comes at a cost in a low overage ratio due to the incompleteness of the ontologies used
Ahmed, M.; Mukhopadhyay, M.; Mukhopadhyay, P.: Automated knowledge organization : AI ML based subject indexing system for libraries (2023) 0.02
```
0.020855065 = product of:
  0.07299273 = sum of:
    0.017794685 = weight(_text_:of in 977) [ClassicSimilarity], result of:
      0.017794685 = score(doc=977,freq=18.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.25915858 = fieldWeight in 977, product of:
          4.2426405 = tf(freq=18.0), with freq of:
            18.0 = termFreq=18.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0390625 = fieldNorm(doc=977)
    0.055198044 = weight(_text_:congress in 977) [ClassicSimilarity], result of:
      0.055198044 = score(doc=977,freq=2.0), product of:
        0.20946044 = queryWeight, product of:
          4.7703104 = idf(docFreq=1018, maxDocs=44218)
          0.043909185 = queryNorm
        0.26352492 = fieldWeight in 977, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.7703104 = idf(docFreq=1018, maxDocs=44218)
          0.0390625 = fieldNorm(doc=977)
  0.2857143 = coord(2/7)
```
Abstract

The research study as reported here is an attempt to explore the possibilities of an AI/ML-based semi-automated indexing system in a library setup to handle large volumes of documents. It uses the Python virtual environment to install and configure an open source AI environment (named Annif) to feed the LOD (Linked Open Data) dataset of Library of Congress Subject Headings (LCSH) as a standard KOS (Knowledge Organisation System). The framework deployed the Turtle format of LCSH after cleaning the file with Skosify, applied an array of backend algorithms (namely TF-IDF, Omikuji, and NN-Ensemble) to measure relative performance, and selected Snowball as an analyser. The training of Annif was conducted with a large set of bibliographic records populated with subject descriptors (MARC tag 650$a) and indexed by trained LIS professionals. The training dataset is first treated with MarcEdit to export it in a format suitable for OpenRefine, and then in OpenRefine it undergoes many steps to produce a bibliographic record set suitable to train Annif. The framework, after training, has been tested with a bibliographic dataset to measure indexing efficiencies, and finally, the automated indexing framework is integrated with data wrangling software (OpenRefine) to produce suggested headings on a mass scale. The entire framework is based on open-source software, open datasets, and open standards.

Source

DESIDOC journal of library and information technology. 43(2023) no.1, S.45-54

Bianchini, C.; Bargioni, S.: Automated classification using linked open data : a case study on faceted classification and Wikidata (2021) 0.02

0.01918008 = product of:
  0.067130275 = sum of:
    0.014383274 = weight(_text_:of in 724) [ClassicSimilarity], result of:
      0.014383274 = score(doc=724,freq=6.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.20947541 = fieldWeight in 724, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.0546875 = fieldNorm(doc=724)
    0.052747004 = weight(_text_:cataloging in 724) [ClassicSimilarity], result of:
      0.052747004 = score(doc=724,freq=2.0), product of:
        0.17305137 = queryWeight, product of:
          3.9411201 = idf(docFreq=2334, maxDocs=44218)
          0.043909185 = queryNorm
        0.30480546 = fieldWeight in 724, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.9411201 = idf(docFreq=2334, maxDocs=44218)
          0.0546875 = fieldNorm(doc=724)
  0.2857143 = coord(2/7)

Abstract: The Wikidata gadget, CCLitBox, for the automated classification of literary authors and works by a faceted classification and using Linked Open Data (LOD) is presented. The tool reproduces the classification algorithm of class O Literature of the Colon Classification and uses data freely available in Wikidata to create Colon Classification class numbers. CCLitBox is totally free and enables any user to classify literary authors and their works; it is easily accessible to everybody; it uses LOD from Wikidata but missing data for classification can be freely added if necessary; it is readymade for any cooperative and networked project.
Source: Cataloging and classification quarterly. 59(2021) no.8, p.835-852

Subramanian, S.; Shafer, K.E.: Clustering (2001) 0.02

0.01595055 = product of:
  0.05582692 = sum of:
    0.020132389 = weight(_text_:of in 1046) [ClassicSimilarity], result of:
      0.020132389 = score(doc=1046,freq=4.0), product of:
        0.06866331 = queryWeight, product of:
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.043909185 = queryNorm
        0.2932045 = fieldWeight in 1046, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.5637573 = idf(docFreq=25162, maxDocs=44218)
          0.09375 = fieldNorm(doc=1046)
    0.035694532 = product of:
      0.071389064 = sum of:
        0.071389064 = weight(_text_:22 in 1046) [ClassicSimilarity], result of:
          0.071389064 = score(doc=1046,freq=2.0), product of:
            0.15376249 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.043909185 = queryNorm
            0.46428138 = fieldWeight in 1046, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.09375 = fieldNorm(doc=1046)
      0.5 = coord(1/2)
  0.2857143 = coord(2/7)

Date: 5. 5.2003 14:17:22
Footnote: Teil eines Themenheftes: OCLC and the Internet: An Historical Overview of Research Activities, 1990-1999 - Part II
Source: Journal of library administration. 34(2001) nos.3/4, S.221-228

Search (172 results, page 1 of 9)

Authors

Years

Languages

Types

Themes

Subjects