Search (70 results, page 1 of 4)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.42

0.42283043 = product of:
  0.63424563 = sum of:
    0.058476865 = product of:
      0.1754306 = sum of:
        0.1754306 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
          0.1754306 = score(doc=562,freq=2.0), product of:
            0.31214407 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.036818076 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.33333334 = coord(1/3)
    0.1754306 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.1754306 = score(doc=562,freq=2.0), product of:
        0.31214407 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.036818076 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.034511987 = weight(_text_:data in 562) [ClassicSimilarity], result of:
      0.034511987 = score(doc=562,freq=4.0), product of:
        0.11642061 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.036818076 = queryNorm
        0.29644224 = fieldWeight in 562, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.1754306 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.1754306 = score(doc=562,freq=2.0), product of:
        0.31214407 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.036818076 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.1754306 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.1754306 = score(doc=562,freq=2.0), product of:
        0.31214407 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.036818076 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.014965023 = product of:
      0.029930046 = sum of:
        0.029930046 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
          0.029930046 = score(doc=562,freq=2.0), product of:
            0.12893063 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.036818076 = queryNorm
            0.23214069 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.5 = coord(1/2)
  0.6666667 = coord(6/9)

Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32
Source: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), 1-4 November 2004, Brighton, UK

Wang, J.: ¬An extensive study on automated Dewey Decimal Classification (2009) 0.02
```
0.020091478 = product of:
  0.09041165 = sum of:
    0.06165166 = weight(_text_:bibliographic in 3172) [ClassicSimilarity], result of:
      0.06165166 = score(doc=3172,freq=8.0), product of:
        0.14333439 = queryWeight, product of:
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.036818076 = queryNorm
        0.43012467 = fieldWeight in 3172, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3172)
    0.028759988 = weight(_text_:data in 3172) [ClassicSimilarity], result of:
      0.028759988 = score(doc=3172,freq=4.0), product of:
        0.11642061 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.036818076 = queryNorm
        0.24703519 = fieldWeight in 3172, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3172)
  0.22222222 = coord(2/9)
```
Abstract

In this paper, we present a theoretical analysis and extensive experiments on the automated assignment of Dewey Decimal Classification (DDC) classes to bibliographic data with a supervised machine-learning approach. Library classification systems, such as the DDC, impose great obstacles on state-of-art text categorization (TC) technologies, including deep hierarchy, data sparseness, and skewed distribution. We first analyze statistically the document and category distributions over the DDC, and discuss the obstacles imposed by bibliographic corpora and library classification schemes on TC technology. To overcome these obstacles, we propose an innovative algorithm to reshape the DDC structure into a balanced virtual tree by balancing the category distribution and flattening the hierarchy. To improve the classification effectiveness to a level acceptable to real-world applications, we propose an interactive classification model that is able to predict a class of any depth within a limited number of user interactions. The experiments are conducted on a large bibliographic collection created by the Library of Congress within the science and technology domains over 10 years. With no more than three interactions, a classification accuracy of nearly 90% is achieved, thus providing a practical solution to the automatic bibliographic classification problem.
Ahmed, M.; Mukhopadhyay, M.; Mukhopadhyay, P.: Automated knowledge organization : AI ML based subject indexing system for libraries (2023) 0.02
```
0.018255975 = product of:
  0.08215189 = sum of:
    0.053391904 = weight(_text_:bibliographic in 977) [ClassicSimilarity], result of:
      0.053391904 = score(doc=977,freq=6.0), product of:
        0.14333439 = queryWeight, product of:
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.036818076 = queryNorm
        0.3724989 = fieldWeight in 977, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.0390625 = fieldNorm(doc=977)
    0.028759988 = weight(_text_:data in 977) [ClassicSimilarity], result of:
      0.028759988 = score(doc=977,freq=4.0), product of:
        0.11642061 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.036818076 = queryNorm
        0.24703519 = fieldWeight in 977, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.0390625 = fieldNorm(doc=977)
  0.22222222 = coord(2/9)
```
Abstract

The research study as reported here is an attempt to explore the possibilities of an AI/ML-based semi-automated indexing system in a library setup to handle large volumes of documents. It uses the Python virtual environment to install and configure an open source AI environment (named Annif) to feed the LOD (Linked Open Data) dataset of Library of Congress Subject Headings (LCSH) as a standard KOS (Knowledge Organisation System). The framework deployed the Turtle format of LCSH after cleaning the file with Skosify, applied an array of backend algorithms (namely TF-IDF, Omikuji, and NN-Ensemble) to measure relative performance, and selected Snowball as an analyser. The training of Annif was conducted with a large set of bibliographic records populated with subject descriptors (MARC tag 650$a) and indexed by trained LIS professionals. The training dataset is first treated with MarcEdit to export it in a format suitable for OpenRefine, and then in OpenRefine it undergoes many steps to produce a bibliographic record set suitable to train Annif. The framework, after training, has been tested with a bibliographic dataset to measure indexing efficiencies, and finally, the automated indexing framework is integrated with data wrangling software (OpenRefine) to produce suggested headings on a mass scale. The entire framework is based on open-source software, open datasets, and open standards.

Reiner, U.: DDC-based search in the data of the German National Bibliography (2008) 0.02

0.015889551 = product of:
  0.07150298 = sum of:
    0.036990993 = weight(_text_:bibliographic in 2166) [ClassicSimilarity], result of:
      0.036990993 = score(doc=2166,freq=2.0), product of:
        0.14333439 = queryWeight, product of:
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.036818076 = queryNorm
        0.2580748 = fieldWeight in 2166, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.046875 = fieldNorm(doc=2166)
    0.034511987 = weight(_text_:data in 2166) [ClassicSimilarity], result of:
      0.034511987 = score(doc=2166,freq=4.0), product of:
        0.11642061 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.036818076 = queryNorm
        0.29644224 = fieldWeight in 2166, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.046875 = fieldNorm(doc=2166)
  0.22222222 = coord(2/9)

Abstract: In 2004, the German National Library began to classify title records of the German National Bibliography according to subject groups based on the divisions of the Dewey Decimal Classification (DDC). Since 2006, all titles of the main series of the German National Bibliography are classified in strict compliance with the DDC. On this basis, an enhanced DDC-based search can be realized - e.g., searching the data of the German National Bibliography for title records using number components of synthesized classification numbers or searching for DDC numbers using unclassified title records. This paper gives an account of the current research and development of the DDC-based search. The work is conducted in the VZG project Colibri that focuses on the automatic analysis of DDC-synthesized numbers and the automatic classification of bibliographic title records.

HaCohen-Kerner, Y. et al.: Classification using various machine learning methods and combinations of key-phrases and visual features (2016) 0.01

0.014580995 = product of:
  0.06561448 = sum of:
    0.040672768 = weight(_text_:data in 2748) [ClassicSimilarity], result of:
      0.040672768 = score(doc=2748,freq=2.0), product of:
        0.11642061 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.036818076 = queryNorm
        0.34936053 = fieldWeight in 2748, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.078125 = fieldNorm(doc=2748)
    0.024941705 = product of:
      0.04988341 = sum of:
        0.04988341 = weight(_text_:22 in 2748) [ClassicSimilarity], result of:
          0.04988341 = score(doc=2748,freq=2.0), product of:
            0.12893063 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.036818076 = queryNorm
            0.38690117 = fieldWeight in 2748, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=2748)
      0.5 = coord(1/2)
  0.22222222 = coord(2/9)

Date: 1. 2.2016 18:25:22
Source: Semantic keyword-based search on structured data sources: First COST Action IC1302 International KEYSTONE Conference, IKC 2015, Coimbra, Portugal, September 8-9, 2015. Revised Selected Papers. Eds.: J. Cardoso et al

Golub, K.; Hansson, J.; Soergel, D.; Tudhope, D.: Managing classification in libraries : a methodological outline for evaluating automatic subject indexing and classification in Swedish library catalogues (2015) 0.01
```
0.014206818 = product of:
  0.06393068 = sum of:
    0.0435943 = weight(_text_:bibliographic in 2300) [ClassicSimilarity], result of:
      0.0435943 = score(doc=2300,freq=4.0), product of:
        0.14333439 = queryWeight, product of:
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.036818076 = queryNorm
        0.30414405 = fieldWeight in 2300, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2300)
    0.020336384 = weight(_text_:data in 2300) [ClassicSimilarity], result of:
      0.020336384 = score(doc=2300,freq=2.0), product of:
        0.11642061 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.036818076 = queryNorm
        0.17468026 = fieldWeight in 2300, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2300)
  0.22222222 = coord(2/9)
```
Abstract

Subject terms play a crucial role in resource discovery but require substantial effort to produce. Automatic subject classification and indexing address problems of scale and sustainability and can be used to enrich existing bibliographic records, establish more connections across and between resources and enhance consistency of bibliographic data. The paper aims to put forward a complex methodological framework to evaluate automatic classification tools of Swedish textual documents based on the Dewey Decimal Classification (DDC) recently introduced to Swedish libraries. Three major complementary approaches are suggested: a quality-built gold standard, retrieval effects, domain analysis. The gold standard is built based on input from at least two catalogue librarians, end-users expert in the subject, end users inexperienced in the subject and automated tools. Retrieval effects are studied through a combination of assigned and free tasks, including factual and comprehensive types. The study also takes into consideration the different role and character of subject terms in various knowledge domains, such as scientific disciplines. As a theoretical framework, domain analysis is used and applied in relation to the implementation of DDC in Swedish libraries and chosen domains of knowledge within the DDC itself.

Yi, K.: Automatic text classification using library classification schemes : trends, issues and challenges (2007) 0.01

0.013470079 = product of:
  0.060615353 = sum of:
    0.04315616 = weight(_text_:bibliographic in 2560) [ClassicSimilarity], result of:
      0.04315616 = score(doc=2560,freq=2.0), product of:
        0.14333439 = queryWeight, product of:
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.036818076 = queryNorm
        0.30108726 = fieldWeight in 2560, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2560)
    0.017459193 = product of:
      0.034918386 = sum of:
        0.034918386 = weight(_text_:22 in 2560) [ClassicSimilarity], result of:
          0.034918386 = score(doc=2560,freq=2.0), product of:
            0.12893063 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.036818076 = queryNorm
            0.2708308 = fieldWeight in 2560, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2560)
      0.5 = coord(1/2)
  0.22222222 = coord(2/9)

Date: 22. 9.2008 18:31:54
Source: International cataloguing and bibliographic control. 36(2007) no.4, S.78-82

Dubin, D.: Dimensions and discriminability (1998) 0.01

0.010206696 = product of:
  0.04593013 = sum of:
    0.028470935 = weight(_text_:data in 2338) [ClassicSimilarity], result of:
      0.028470935 = score(doc=2338,freq=2.0), product of:
        0.11642061 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.036818076 = queryNorm
        0.24455236 = fieldWeight in 2338, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.0546875 = fieldNorm(doc=2338)
    0.017459193 = product of:
      0.034918386 = sum of:
        0.034918386 = weight(_text_:22 in 2338) [ClassicSimilarity], result of:
          0.034918386 = score(doc=2338,freq=2.0), product of:
            0.12893063 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.036818076 = queryNorm
            0.2708308 = fieldWeight in 2338, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2338)
      0.5 = coord(1/2)
  0.22222222 = coord(2/9)

Date: 22. 9.1997 19:16:05
Source: Visualizing subject access for 21st century information resources: Papers presented at the 1997 Clinic on Library Applications of Data Processing, 2-4 Mar 1997, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign. Ed.: P.A. Cochrane et al

Yoon, Y.; Lee, C.; Lee, G.G.: ¬An effective procedure for constructing a hierarchical text classification system (2006) 0.01

0.010206696 = product of:
  0.04593013 = sum of:
    0.028470935 = weight(_text_:data in 5273) [ClassicSimilarity], result of:
      0.028470935 = score(doc=5273,freq=2.0), product of:
        0.11642061 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.036818076 = queryNorm
        0.24455236 = fieldWeight in 5273, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.0546875 = fieldNorm(doc=5273)
    0.017459193 = product of:
      0.034918386 = sum of:
        0.034918386 = weight(_text_:22 in 5273) [ClassicSimilarity], result of:
          0.034918386 = score(doc=5273,freq=2.0), product of:
            0.12893063 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.036818076 = queryNorm
            0.2708308 = fieldWeight in 5273, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=5273)
      0.5 = coord(1/2)
  0.22222222 = coord(2/9)

Abstract: In text categorization tasks, classification on some class hierarchies has better results than in cases without the hierarchy. Currently, because a large number of documents are divided into several subgroups in a hierarchy, we can appropriately use a hierarchical classification method. However, we have no systematic method to build a hierarchical classification system that performs well with large collections of practical data. In this article, we introduce a new evaluation scheme for internal node classifiers, which can be used effectively to develop a hierarchical classification system. We also show that our method for constructing the hierarchical classification system is very effective, especially for the task of constructing classifiers applied to hierarchy tree with a lot of levels.
Date: 22. 7.2006 16:24:52

Schiminovich, S.: Automatic classification and retrieval of documents by means of a bibliographic pattern discovery algorithm (1971) 0.01

0.009590258 = product of:
  0.08631232 = sum of:
    0.08631232 = weight(_text_:bibliographic in 4846) [ClassicSimilarity], result of:
      0.08631232 = score(doc=4846,freq=2.0), product of:
        0.14333439 = queryWeight, product of:
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.036818076 = queryNorm
        0.6021745 = fieldWeight in 4846, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.109375 = fieldNorm(doc=4846)
  0.11111111 = coord(1/9)

Wu, M.; Liu, Y.-H.; Brownlee, R.; Zhang, X.: Evaluating utility and automatic classification of subject metadata from Research Data Australia (2021) 0.01
```
0.0076693306 = product of:
  0.069023974 = sum of:
    0.069023974 = weight(_text_:data in 453) [ClassicSimilarity], result of:
      0.069023974 = score(doc=453,freq=16.0), product of:
        0.11642061 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.036818076 = queryNorm
        0.5928845 = fieldWeight in 453, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.046875 = fieldNorm(doc=453)
  0.11111111 = coord(1/9)
```
Abstract

In this paper, we present a case study of how well subject metadata (comprising headings from an international classification scheme) has been deployed in a national data catalogue, and how often data seekers use subject metadata when searching for data. Through an analysis of user search behaviour as recorded in search logs, we find evidence that users utilise the subject metadata for data discovery. Since approximately half of the records ingested by the catalogue did not include subject metadata at the time of harvest, we experimented with automatic subject classification approaches in order to enrich these records and to provide additional support for user search and data discovery. Our results show that automatic methods work well for well represented categories of subject metadata, and these categories tend to have features that can distinguish themselves from the other categories. Our findings raise implications for data catalogue providers; they should invest more effort to enhance the quality of data records by providing an adequate description of these records for under-represented subject categories.
Mengle, S.; Goharian, N.: Passage detection using text classification (2009) 0.01
```
0.0072904974 = product of:
  0.03280724 = sum of:
    0.020336384 = weight(_text_:data in 2765) [ClassicSimilarity], result of:
      0.020336384 = score(doc=2765,freq=2.0), product of:
        0.11642061 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.036818076 = queryNorm
        0.17468026 = fieldWeight in 2765, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2765)
    0.012470853 = product of:
      0.024941705 = sum of:
        0.024941705 = weight(_text_:22 in 2765) [ClassicSimilarity], result of:
          0.024941705 = score(doc=2765,freq=2.0), product of:
            0.12893063 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.036818076 = queryNorm
            0.19345059 = fieldWeight in 2765, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2765)
      0.5 = coord(1/2)
  0.22222222 = coord(2/9)
```
Abstract

Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retrieval is the determination of the degree of relevance of blocks of text, namely passages, comprising a document. Rather than determining the relevance of a document in its entirety, passage retrieval determines the relevance of the individual passages. As such, modified traditional information-retrieval techniques compare terms found in user queries with the individual passages to determine a similarity score for passages of interest. In passage detection, passages are classified into predetermined categories. More often than not, passage detection techniques are deployed to detect hidden paragraphs in documents. That is, to hide information, documents are injected with hidden text into passages. Rather than matching query terms against passages to determine their relevance, using text-mining techniques, the passages are classified. Those documents with hidden passages are defined as infected. Thus, simply stated, passage retrieval is the search for passages relevant to a user query, while passage detection is the classification of passages. That is, in passage detection, passages are labeled with one or more categories from a set of predetermined categories. We present a keyword-based dynamic passage approach (KDP) and demonstrate that KDP outperforms statistically significantly (99% confidence) the other document-splitting approaches by 12% to 18% in the passage detection and passage category-prediction tasks. Furthermore, we evaluate the effects of the feature selection, passage length, ambiguous passages, and finally training-data category distribution on passage-detection accuracy.

Date

22. 3.2009 19:14:43

Fong, A.C.M.: Mining a Web citation database for document clustering (2002) 0.01

0.0063268747 = product of:
  0.05694187 = sum of:
    0.05694187 = weight(_text_:data in 3940) [ClassicSimilarity], result of:
      0.05694187 = score(doc=3940,freq=2.0), product of:
        0.11642061 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.036818076 = queryNorm
        0.48910472 = fieldWeight in 3940, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.109375 = fieldNorm(doc=3940)
  0.11111111 = coord(1/9)

Theme: Data Mining

Bianchini, C.; Bargioni, S.: Automated classification using linked open data : a case study on faceted classification and Wikidata (2021) 0.01
```
0.0063268747 = product of:
  0.05694187 = sum of:
    0.05694187 = weight(_text_:data in 724) [ClassicSimilarity], result of:
      0.05694187 = score(doc=724,freq=8.0), product of:
        0.11642061 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.036818076 = queryNorm
        0.48910472 = fieldWeight in 724, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.0546875 = fieldNorm(doc=724)
  0.11111111 = coord(1/9)
```
Abstract

The Wikidata gadget, CCLitBox, for the automated classification of literary authors and works by a faceted classification and using Linked Open Data (LOD) is presented. The tool reproduces the classification algorithm of class O Literature of the Colon Classification and uses data freely available in Wikidata to create Colon Classification class numbers. CCLitBox is totally free and enables any user to classify literary authors and their works; it is easily accessible to everybody; it uses LOD from Wikidata but missing data for classification can be freely added if necessary; it is readymade for any cooperative and networked project.
Classification, automation, and new media : Proceedings of the 24th Annual Conference of the Gesellschaft für Klassifikation e.V., University of Passau, March 15 - 17, 2000 (2002) 0.01
```
0.005978335 = product of:
  0.053805016 = sum of:
    0.053805016 = weight(_text_:data in 5997) [ClassicSimilarity], result of:
      0.053805016 = score(doc=5997,freq=14.0), product of:
        0.11642061 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.036818076 = queryNorm
        0.46216056 = fieldWeight in 5997, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5997)
  0.11111111 = coord(1/9)
```
Abstract

Given the huge amount of information in the internet and in practically every domain of knowledge that we are facing today, knowledge discovery calls for automation. The book deals with methods from classification and data analysis that respond effectively to this rapidly growing challenge. The interested reader will find new methodological insights as well as applications in economics, management science, finance, and marketing, and in pattern recognition, biology, health, and archaeology.

Content

Data Analysis, Statistics, and Classification.- Pattern Recognition and Automation.- Data Mining, Information Processing, and Automation.- New Media, Web Mining, and Automation.- Applications in Management Science, Finance, and Marketing.- Applications in Medicine, Biology, Archaeology, and Others.- Author Index.- Subject Index.

RSWK

Data Mining / Kongress / Passau <2000>

Series

Proceedings of the ... annual conference of the Gesellschaft für Klassifikation e.V. ; 24)(Studies in classification, data analysis, and knowledge organization

Subject

Data Mining / Kongress / Passau <2000>

Theme

Data Mining
Guerrero-Bote, V.P.; Moya Anegón, F. de; Herrero Solana, V.: Document organization using Kohonen's algorithm (2002) 0.01
```
0.005480147 = product of:
  0.049321324 = sum of:
    0.049321324 = weight(_text_:bibliographic in 2564) [ClassicSimilarity], result of:
      0.049321324 = score(doc=2564,freq=2.0), product of:
        0.14333439 = queryWeight, product of:
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.036818076 = queryNorm
        0.34409973 = fieldWeight in 2564, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.0625 = fieldNorm(doc=2564)
  0.11111111 = coord(1/9)
```
Abstract

The classification of documents from a bibliographic database is a task that is linked to processes of information retrieval based on partial matching. A method is described of vectorizing reference documents from LISA which permits their topological organization using Kohonen's algorithm. As an example a map is generated of 202 documents from LISA, and an analysis is made of the possibilities of this type of neural network with respect to the development of information retrieval systems based on graphical browsing.

Autonomy, Inc.: Automatic classification (o.J.) 0.01

0.005112887 = product of:
  0.04601598 = sum of:
    0.04601598 = weight(_text_:data in 1666) [ClassicSimilarity], result of:
      0.04601598 = score(doc=1666,freq=4.0), product of:
        0.11642061 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.036818076 = queryNorm
        0.3952563 = fieldWeight in 1666, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.0625 = fieldNorm(doc=1666)
  0.11111111 = coord(1/9)

Abstract: Autonomy's Classification solutions remove the necessity for organizations to rely on human intervention or manual processing of information, such as manual tagging, typically required to make most other e-business applications work. Autonomy's ability to consistently and accurately classify data automatically is a unique infrastructure solution that overcomes the predicaments surrounding the exponential growth of unstructured data.

Billal, B.; Fonseca, A.; Sadat, F.; Lounis, H.: Semi-supervised learning and social media text analysis towards multi-labeling categorization (2017) 0.01
```
0.005112887 = product of:
  0.04601598 = sum of:
    0.04601598 = weight(_text_:data in 4095) [ClassicSimilarity], result of:
      0.04601598 = score(doc=4095,freq=16.0), product of:
        0.11642061 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.036818076 = queryNorm
        0.3952563 = fieldWeight in 4095, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.03125 = fieldNorm(doc=4095)
  0.11111111 = coord(1/9)
```
Abstract

In traditional text classification, classes are mutually exclusive, i.e. it is not possible to have one text or text fragment classified into more than one class. On the other hand, in multi-label classification an individual text may belong to several classes simultaneously. This type of classification is required by a large number of current applications such as big data classification, images and video annotation. Supervised learning is the most used type of machine learning in the classification task. It requires large quantities of labeled data and the intervention of a human tagger in the creation of the training sets. When the data sets become very large or heavily noisy, this operation can be tedious, prone to error and time consuming. In this case, semi-supervised learning, which requires only few labels, is a better choice. In this paper, we study and evaluate several methods to address the problem of multi-label classification using semi-supervised learning and data from social networks. First, we propose a linguistic pre-processing involving tokeni-sation, recognition of named entities and hashtag segmentation in order to decrease the noise in this type of massive and unstructured real data and then we perform a word sense disambiguation using WordNet. Second, several experiments related to multi-label classification and semi-supervised learning are carried out on these data sets and compared to each other. These evaluations compare the results of the approaches considered. This paper proposes a method for combining semi-supervised methods with a graph method for the extraction of subjects in social networks using a multi-label classification approach. Experiments show that the performance of the proposed model increases in 4 p.p. the precision of the classification when compared to a baseline.

Source

IEEE International Conference on Big Data (Big Data) (2017)
Salles, T.; Rocha, L.; Gonçalves, M.A.; Almeida, J.M.; Mourão, F.; Meira Jr., W.; Viegas, F.: ¬A quantitative analysis of the temporal effects on automatic text classification (2016) 0.01
```
0.0050526154 = product of:
  0.04547354 = sum of:
    0.04547354 = weight(_text_:data in 3014) [ClassicSimilarity], result of:
      0.04547354 = score(doc=3014,freq=10.0), product of:
        0.11642061 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.036818076 = queryNorm
        0.39059696 = fieldWeight in 3014, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3014)
  0.11111111 = coord(1/9)
```
Abstract

Automatic text classification (TC) continues to be a relevant research topic and several TC algorithms have been proposed. However, the majority of TC algorithms assume that the underlying data distribution does not change over time. In this work, we are concerned with the challenges imposed by the temporal dynamics observed in textual data sets. We provide evidence of the existence of temporal effects in three textual data sets, reflected by variations observed over time in the class distribution, in the pairwise class similarities, and in the relationships between terms and classes. We then quantify, using a series of full factorial design experiments, the impact of these effects on four well-known TC algorithms. We show that these temporal effects affect each analyzed data set differently and that they restrict the performance of each considered TC algorithm to different extents. The reported quantitative analyses, which are the original contributions of this article, provide valuable new insights to better understand the behavior of TC algorithms when faced with nonstatic (temporal) data distributions and highlight important requirements for the proposal of more accurate classification models.
Wille, J.: Automatisches Klassifizieren bibliographischer Beschreibungsdaten : Vorgehensweise und Ergebnisse (2006) 0.00
```
0.004795129 = product of:
  0.04315616 = sum of:
    0.04315616 = weight(_text_:bibliographic in 6090) [ClassicSimilarity], result of:
      0.04315616 = score(doc=6090,freq=2.0), product of:
        0.14333439 = queryWeight, product of:
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.036818076 = queryNorm
        0.30108726 = fieldWeight in 6090, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.893044 = idf(docFreq=2449, maxDocs=44218)
          0.0546875 = fieldNorm(doc=6090)
  0.11111111 = coord(1/9)
```
Abstract

Diese Arbeit befasst sich mit den praktischen Aspekten des Automatischen Klassifizierens bibliographischer Referenzdaten. Im Vordergrund steht die konkrete Vorgehensweise anhand des eigens zu diesem Zweck entwickelten Open Source-Programms COBRA "Classification Of Bibliographic Records, Automatic". Es werden die Rahmenbedingungen und Parameter f¨ur einen Einsatz im bibliothekarischen Umfeld geklärt. Schließlich erfolgt eine Auswertung von Klassifizierungsergebnissen am Beispiel sozialwissenschaftlicher Daten aus der Datenbank SOLIS.

Search (70 results, page 1 of 4)

Authors

Years

Languages

Types

Themes

Subjects