Search (84 results, page 1 of 5)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.13

0.13194287 = product of:
  0.26388574 = sum of:
    0.062004488 = product of:
      0.18601346 = sum of:
        0.18601346 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
          0.18601346 = score(doc=562,freq=2.0), product of:
            0.3309742 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.03903913 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.33333334 = coord(1/3)
    0.18601346 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.18601346 = score(doc=562,freq=2.0), product of:
        0.3309742 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.03903913 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.01586779 = product of:
      0.03173558 = sum of:
        0.03173558 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
          0.03173558 = score(doc=562,freq=2.0), product of:
            0.1367084 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03903913 = queryNorm
            0.23214069 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.5 = coord(1/2)
  0.5 = coord(3/6)

Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32

HaCohen-Kerner, Y. et al.: Classification using various machine learning methods and combinations of key-phrases and visual features (2016) 0.02

0.02043515 = product of:
  0.1226109 = sum of:
    0.1226109 = sum of:
      0.069718264 = weight(_text_:methods in 2748) [ClassicSimilarity], result of:
        0.069718264 = score(doc=2748,freq=2.0), product of:
          0.15695344 = queryWeight, product of:
            4.0204134 = idf(docFreq=2156, maxDocs=44218)
            0.03903913 = queryNorm
          0.4441971 = fieldWeight in 2748, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.0204134 = idf(docFreq=2156, maxDocs=44218)
            0.078125 = fieldNorm(doc=2748)
      0.052892637 = weight(_text_:22 in 2748) [ClassicSimilarity], result of:
        0.052892637 = score(doc=2748,freq=2.0), product of:
          0.1367084 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.03903913 = queryNorm
          0.38690117 = fieldWeight in 2748, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.078125 = fieldNorm(doc=2748)
  0.16666667 = coord(1/6)

Date: 1. 2.2016 18:25:22

Savic, D.: Automatic classification of office documents : review of available methods and techniques (1995) 0.02

0.01772975 = product of:
  0.053189248 = sum of:
    0.018680464 = product of:
      0.03736093 = sum of:
        0.03736093 = weight(_text_:29 in 2219) [ClassicSimilarity], result of:
          0.03736093 = score(doc=2219,freq=2.0), product of:
            0.13732746 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.03903913 = queryNorm
            0.27205724 = fieldWeight in 2219, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2219)
      0.5 = coord(1/2)
    0.034508783 = product of:
      0.06901757 = sum of:
        0.06901757 = weight(_text_:methods in 2219) [ClassicSimilarity], result of:
          0.06901757 = score(doc=2219,freq=4.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.43973273 = fieldWeight in 2219, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2219)
      0.5 = coord(1/2)
  0.33333334 = coord(2/6)

Abstract: Classification of office documents is one of the administrative functions carried out by almost every organization and institution which sends and receives correspondence. Processing of this increasing amount of information coming and out going mail, in particular its classification, is time consuming and expensive. More and more organizations are seeking a solution for meeting this challenge by designing computer based systems for automatic classification. Examines the present status of available knowledge and methodology which can be used for automatic classification of office documents. Besides a review of classic methods and techniques, the focus id also placed on the application of artificial intelligence
Source: Records management quarterly. 29(1995) no.4, S.3-18

Liu, X.; Yu, S.; Janssens, F.; Glänzel, W.; Moreau, Y.; Moor, B.de: Weighted hybrid clustering by combining text mining and bibliometrics on a large-scale journal database (2010) 0.02

0.017412834 = product of:
  0.0522385 = sum of:
    0.016011827 = product of:
      0.032023653 = sum of:
        0.032023653 = weight(_text_:29 in 3464) [ClassicSimilarity], result of:
          0.032023653 = score(doc=3464,freq=2.0), product of:
            0.13732746 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.03903913 = queryNorm
            0.23319192 = fieldWeight in 3464, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.046875 = fieldNorm(doc=3464)
      0.5 = coord(1/2)
    0.036226675 = product of:
      0.07245335 = sum of:
        0.07245335 = weight(_text_:methods in 3464) [ClassicSimilarity], result of:
          0.07245335 = score(doc=3464,freq=6.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.4616232 = fieldWeight in 3464, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.046875 = fieldNorm(doc=3464)
      0.5 = coord(1/2)
  0.33333334 = coord(2/6)

Abstract: We propose a new hybrid clustering framework to incorporate text mining with bibliometrics in journal set analysis. The framework integrates two different approaches: clustering ensemble and kernel-fusion clustering. To improve the flexibility and the efficiency of processing large-scale data, we propose an information-based weighting scheme to leverage the effect of multiple data sources in hybrid clustering. Three different algorithms are extended by the proposed weighting scheme and they are employed on a large journal set retrieved from the Web of Science (WoS) database. The clustering performance of the proposed algorithms is systematically evaluated using multiple evaluation methods, and they were cross-compared with alternative methods. Experimental results demonstrate that the proposed weighted hybrid clustering strategy is superior to other methods in clustering performance and efficiency. The proposed approach also provides a more refined structural mapping of journal sets, which is useful for monitoring and detecting new trends in different scientific fields.
Date: 1. 6.2010 9:29:57

Teich, E.; Degaetano-Ortlieb, S.; Fankhauser, P.; Kermes, H.; Lapshinova-Koltunski, E.: ¬The linguistic construal of disciplinarity : a data-mining approach using register features (2016) 0.02
```
0.017318232 = product of:
  0.051954694 = sum of:
    0.022375738 = product of:
      0.044751476 = sum of:
        0.044751476 = weight(_text_:theory in 3015) [ClassicSimilarity], result of:
          0.044751476 = score(doc=3015,freq=2.0), product of:
            0.16234003 = queryWeight, product of:
              4.1583924 = idf(docFreq=1878, maxDocs=44218)
              0.03903913 = queryNorm
            0.27566507 = fieldWeight in 3015, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.1583924 = idf(docFreq=1878, maxDocs=44218)
              0.046875 = fieldNorm(doc=3015)
      0.5 = coord(1/2)
    0.029578956 = product of:
      0.05915791 = sum of:
        0.05915791 = weight(_text_:methods in 3015) [ClassicSimilarity], result of:
          0.05915791 = score(doc=3015,freq=4.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.37691376 = fieldWeight in 3015, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.046875 = fieldNorm(doc=3015)
      0.5 = coord(1/2)
  0.33333334 = coord(2/6)
```
Abstract

We analyze the linguistic evolution of selected scientific disciplines over a 30-year time span (1970s to 2000s). Our focus is on four highly specialized disciplines at the boundaries of computer science that emerged during that time: computational linguistics, bioinformatics, digital construction, and microelectronics. Our analysis is driven by the question whether these disciplines develop a distinctive language use-both individually and collectively-over the given time period. The data set is the English Scientific Text Corpus (scitex), which includes texts from the 1970s/1980s and early 2000s. Our theoretical basis is register theory. In terms of methods, we combine corpus-based methods of feature extraction (various aggregated features [part-of-speech based], n-grams, lexico-grammatical patterns) and automatic text classification. The results of our research are directly relevant to the study of linguistic variation and languages for specific purposes (LSP) and have implications for various natural language processing (NLP) tasks, for example, authorship attribution, text mining, or training NLP tools.
Ibekwe-SanJuan, F.; SanJuan, E.: From term variants to research topics (2002) 0.01
```
0.014510695 = product of:
  0.043532085 = sum of:
    0.01334319 = product of:
      0.02668638 = sum of:
        0.02668638 = weight(_text_:29 in 1853) [ClassicSimilarity], result of:
          0.02668638 = score(doc=1853,freq=2.0), product of:
            0.13732746 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.03903913 = queryNorm
            0.19432661 = fieldWeight in 1853, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1853)
      0.5 = coord(1/2)
    0.030188894 = product of:
      0.060377788 = sum of:
        0.060377788 = weight(_text_:methods in 1853) [ClassicSimilarity], result of:
          0.060377788 = score(doc=1853,freq=6.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.384686 = fieldWeight in 1853, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1853)
      0.5 = coord(1/2)
  0.33333334 = coord(2/6)
```
Abstract

In a scientific and technological watch (STW) task, an expert user needs to survey the evolution of research topics in his area of specialisation in order to detect interesting changes. The majority of methods proposing evaluation metrics (bibliometrics and scientometrics studies) for STW rely solely an statistical data analysis methods (Co-citation analysis, co-word analysis). Such methods usually work an structured databases where the units of analysis (words, keywords) are already attributed to documents by human indexers. The advent of huge amounts of unstructured textual data has rendered necessary the integration of natural language processing (NLP) techniques to first extract meaningful units from texts. We propose a method for STW which is NLP-oriented. The method not only analyses texts linguistically in order to extract terms from them, but also uses linguistic relations (syntactic variations) as the basis for clustering. Terms and variation relations are formalised as weighted di-graphs which the clustering algorithm, CPCL (Classification by Preferential Clustered Link) will seek to reduce in order to produces classes. These classes ideally represent the research topics present in the corpus. The results of the classification are subjected to validation by an expert in STW.

Source

Knowledge organization. 29(2002) nos.3/4, S.181-197

Ruiz, M.E.; Srinivasan, P.: Combining machine learning and hierarchical indexing structures for text categorization (2001) 0.01

0.01436062 = product of:
  0.043081857 = sum of:
    0.018680464 = product of:
      0.03736093 = sum of:
        0.03736093 = weight(_text_:29 in 1595) [ClassicSimilarity], result of:
          0.03736093 = score(doc=1595,freq=2.0), product of:
            0.13732746 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.03903913 = queryNorm
            0.27205724 = fieldWeight in 1595, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1595)
      0.5 = coord(1/2)
    0.024401393 = product of:
      0.048802786 = sum of:
        0.048802786 = weight(_text_:methods in 1595) [ClassicSimilarity], result of:
          0.048802786 = score(doc=1595,freq=2.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.31093797 = fieldWeight in 1595, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1595)
      0.5 = coord(1/2)
  0.33333334 = coord(2/6)

Abstract: This paper presents a method that exploits the hierarchical structure of an indexing vocabulary to guide the development and training of machine learning methods for automatic text categorization. We present the design of a hierarchical classifier based an the divide-and-conquer principle. The method is evaluated using backpropagation neural networks, such as the machine learning algorithm, that leam to assign MeSH categories to a subset of MEDLINE records. Comparisons with traditional Rocchio's algorithm adapted for text categorization, as well as flat neural network classifiers, are provided. The results indicate that the use of hierarchical structures improves Performance significantly.
Date: 11. 5.2003 18:29:44

Yi, K.: Automatic text classification using library classification schemes : trends, issues and challenges (2007) 0.01

0.014304606 = product of:
  0.085827634 = sum of:
    0.085827634 = sum of:
      0.048802786 = weight(_text_:methods in 2560) [ClassicSimilarity], result of:
        0.048802786 = score(doc=2560,freq=2.0), product of:
          0.15695344 = queryWeight, product of:
            4.0204134 = idf(docFreq=2156, maxDocs=44218)
            0.03903913 = queryNorm
          0.31093797 = fieldWeight in 2560, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.0204134 = idf(docFreq=2156, maxDocs=44218)
            0.0546875 = fieldNorm(doc=2560)
      0.037024844 = weight(_text_:22 in 2560) [ClassicSimilarity], result of:
        0.037024844 = score(doc=2560,freq=2.0), product of:
          0.1367084 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.03903913 = queryNorm
          0.2708308 = fieldWeight in 2560, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0546875 = fieldNorm(doc=2560)
  0.16666667 = coord(1/6)

Abstract: The proliferation of digital resources and their integration into a traditional library setting has created a pressing need for an automated tool that organizes textual information based on library classification schemes. Automated text classification is a research field of developing tools, methods, and models to automate text classification. This article describes the current popular approach for text classification and major text classification projects and applications that are based on library classification schemes. Related issues and challenges are discussed, and a number of considerations for the challenges are examined.
Date: 22. 9.2008 18:31:54

Ma, Z.; Sun, A.; Cong, G.: On predicting the popularity of newly emerging hashtags in Twitter (2013) 0.01
```
0.012664106 = product of:
  0.037992317 = sum of:
    0.01334319 = product of:
      0.02668638 = sum of:
        0.02668638 = weight(_text_:29 in 967) [ClassicSimilarity], result of:
          0.02668638 = score(doc=967,freq=2.0), product of:
            0.13732746 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.03903913 = queryNorm
            0.19432661 = fieldWeight in 967, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.0390625 = fieldNorm(doc=967)
      0.5 = coord(1/2)
    0.024649128 = product of:
      0.049298257 = sum of:
        0.049298257 = weight(_text_:methods in 967) [ClassicSimilarity], result of:
          0.049298257 = score(doc=967,freq=4.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.31409478 = fieldWeight in 967, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.0390625 = fieldNorm(doc=967)
      0.5 = coord(1/2)
  0.33333334 = coord(2/6)
```
Abstract

Because of Twitter's popularity and the viral nature of information dissemination on Twitter, predicting which Twitter topics will become popular in the near future becomes a task of considerable economic importance. Many Twitter topics are annotated by hashtags. In this article, we propose methods to predict the popularity of new hashtags on Twitter by formulating the problem as a classification task. We use five standard classification models (i.e., Naïve bayes, k-nearest neighbors, decision trees, support vector machines, and logistic regression) for prediction. The main challenge is the identification of effective features for describing new hashtags. We extract 7 content features from a hashtag string and the collection of tweets containing the hashtag and 11 contextual features from the social graph formed by users who have adopted the hashtag. We conducted experiments on a Twitter data set consisting of 31 million tweets from 2 million Singapore-based users. The experimental results show that the standard classifiers using the extracted features significantly outperform the baseline methods that do not use these features. Among the five classifiers, the logistic regression model performs the best in terms of the Micro-F1 measure. We also observe that contextual features are more effective than content features.

Date

25. 6.2013 19:05:29

Drori, O.; Alon, N.: Using document classification for displaying search results (2003) 0.01

0.012309102 = product of:
  0.036927305 = sum of:
    0.016011827 = product of:
      0.032023653 = sum of:
        0.032023653 = weight(_text_:29 in 1565) [ClassicSimilarity], result of:
          0.032023653 = score(doc=1565,freq=2.0), product of:
            0.13732746 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.03903913 = queryNorm
            0.23319192 = fieldWeight in 1565, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.046875 = fieldNorm(doc=1565)
      0.5 = coord(1/2)
    0.020915478 = product of:
      0.041830957 = sum of:
        0.041830957 = weight(_text_:methods in 1565) [ClassicSimilarity], result of:
          0.041830957 = score(doc=1565,freq=2.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.26651827 = fieldWeight in 1565, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.046875 = fieldNorm(doc=1565)
      0.5 = coord(1/2)
  0.33333334 = coord(2/6)

Abstract: In this paper, four self-developed user interfaces that display document search results using different methods were compared. In order to create the four interfaces, two information elements: document categories and lines from the document were used. A user study compared the four interfaces. It was found that the category addition to the interface was beneficial in both measurable and subjective measures. It was also found that displaying the relevant lines from the document increased the effectiveness and shortened the search time in all cases and tasks. It was found that the participants preferred the interface containing categories and relevant lines to all other interfaces checked. It was also the fastest in the objective time measurement. Another sub-research that was conducted showed that the most important parameter for the users was the confidence level that the answer was accurate, and the least important parameter was the feeling of comfort while conducting a search
Source: Journal of information science. 29(2003) no.2, S.97-106

Liu, R.-L.: Context recognition for hierarchical text classification (2009) 0.01
```
0.012261091 = product of:
  0.07356654 = sum of:
    0.07356654 = sum of:
      0.041830957 = weight(_text_:methods in 2760) [ClassicSimilarity], result of:
        0.041830957 = score(doc=2760,freq=2.0), product of:
          0.15695344 = queryWeight, product of:
            4.0204134 = idf(docFreq=2156, maxDocs=44218)
            0.03903913 = queryNorm
          0.26651827 = fieldWeight in 2760, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.0204134 = idf(docFreq=2156, maxDocs=44218)
            0.046875 = fieldNorm(doc=2760)
      0.03173558 = weight(_text_:22 in 2760) [ClassicSimilarity], result of:
        0.03173558 = score(doc=2760,freq=2.0), product of:
          0.1367084 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.03903913 = queryNorm
          0.23214069 = fieldWeight in 2760, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.046875 = fieldNorm(doc=2760)
  0.16666667 = coord(1/6)
```
Abstract

Information is often organized as a text hierarchy. A hierarchical text-classification system is thus essential for the management, sharing, and dissemination of information. It aims to automatically classify each incoming document into zero, one, or several categories in the text hierarchy. In this paper, we present a technique called CRHTC (context recognition for hierarchical text classification) that performs hierarchical text classification by recognizing the context of discussion (COD) of each category. A category's COD is governed by its ancestor categories, whose contents indicate contextual backgrounds of the category. A document may be classified into a category only if its content matches the category's COD. CRHTC does not require any trials to manually set parameters, and hence is more portable and easier to implement than other methods. It is empirically evaluated under various conditions. The results show that CRHTC achieves both better and more stable performance than several hierarchical and nonhierarchical text-classification methodologies.

Date

22. 3.2009 19:11:54

Egbert, J.; Biber, D.; Davies, M.: Developing a bottom-up, user-based method of web register classification (2015) 0.01

0.012261091 = product of:
  0.07356654 = sum of:
    0.07356654 = sum of:
      0.041830957 = weight(_text_:methods in 2158) [ClassicSimilarity], result of:
        0.041830957 = score(doc=2158,freq=2.0), product of:
          0.15695344 = queryWeight, product of:
            4.0204134 = idf(docFreq=2156, maxDocs=44218)
            0.03903913 = queryNorm
          0.26651827 = fieldWeight in 2158, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.0204134 = idf(docFreq=2156, maxDocs=44218)
            0.046875 = fieldNorm(doc=2158)
      0.03173558 = weight(_text_:22 in 2158) [ClassicSimilarity], result of:
        0.03173558 = score(doc=2158,freq=2.0), product of:
          0.1367084 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.03903913 = queryNorm
          0.23214069 = fieldWeight in 2158, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.046875 = fieldNorm(doc=2158)
  0.16666667 = coord(1/6)

Abstract: This paper introduces a project to develop a reliable, cost-effective method for classifying Internet texts into register categories, and apply that approach to the analysis of a large corpus of web documents. To date, the project has proceeded in 2 key phases. First, we developed a bottom-up method for web register classification, asking end users of the web to utilize a decision-tree survey to code relevant situational characteristics of web documents, resulting in a bottom-up identification of register and subregister categories. We present details regarding the development and testing of this method through a series of 10 pilot studies. Then, in the second phase of our project we applied this procedure to a corpus of 53,000 web documents. An analysis of the results demonstrates the effectiveness of these methods for web register classification and provides a preliminary description of the types and distribution of registers on the web.
Date: 4. 8.2015 19:22:04

Aphinyanaphongs, Y.; Fu, L.D.; Li, Z.; Peskin, E.R.; Efstathiadis, E.; Aliferis, C.F.; Statnikov, A.: ¬A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization (2014) 0.01
```
0.0092228595 = product of:
  0.055337157 = sum of:
    0.055337157 = product of:
      0.110674314 = sum of:
        0.110674314 = weight(_text_:methods in 1496) [ClassicSimilarity], result of:
          0.110674314 = score(doc=1496,freq=14.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.70514107 = fieldWeight in 1496, product of:
              3.7416575 = tf(freq=14.0), with freq of:
                14.0 = termFreq=14.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.046875 = fieldNorm(doc=1496)
      0.5 = coord(1/2)
  0.16666667 = coord(1/6)
```
Abstract

An important aspect to performing text categorization is selecting appropriate supervised classification and feature selection methods. A comprehensive benchmark is needed to inform best practices in this broad application field. Previous benchmarks have evaluated performance for a few supervised classification and feature selection methods and limited ways to optimize them. The present work updates prior benchmarks by increasing the number of classifiers and feature selection methods order of magnitude, including adding recently developed, state-of-the-art methods. Specifically, this study used 229 text categorization data sets/tasks, and evaluated 28 classification methods (both well-established and proprietary/commercial) and 19 feature selection methods according to 4 classification performance metrics. We report several key findings that will be helpful in establishing best methodological practices for text categorization.

Zhang, X: Rough set theory based automatic text categorization (2005) 0.01

0.008612426 = product of:
  0.051674556 = sum of:
    0.051674556 = product of:
      0.10334911 = sum of:
        0.10334911 = weight(_text_:theory in 2822) [ClassicSimilarity], result of:
          0.10334911 = score(doc=2822,freq=6.0), product of:
            0.16234003 = queryWeight, product of:
              4.1583924 = idf(docFreq=1878, maxDocs=44218)
              0.03903913 = queryNorm
            0.63662124 = fieldWeight in 2822, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.1583924 = idf(docFreq=1878, maxDocs=44218)
              0.0625 = fieldNorm(doc=2822)
      0.5 = coord(1/2)
  0.16666667 = coord(1/6)

Abstract: Der Forschungsbericht "Rough Set Theory Based Automatic Text Categorization and the Handling of Semantic Heterogeneity" von Xueying Zhang ist in Buchform auf Englisch erschienen. Zhang hat in ihrer Arbeit ein Verfahren basierend auf der Rough Set Theory entwickelt, das Beziehungen zwischen Schlagwörtern verschiedener Vokabulare herstellt. Sie war von 2003 bis 2005 Mitarbeiterin des IZ und ist seit Oktober 2005 Associate Professor an der Nanjing University of Science and Technology.

Piros, A.: Automatic interpretation of complex UDC numbers : towards support for library systems (2015) 0.01
```
0.008206069 = product of:
  0.024618205 = sum of:
    0.010674552 = product of:
      0.021349104 = sum of:
        0.021349104 = weight(_text_:29 in 2301) [ClassicSimilarity], result of:
          0.021349104 = score(doc=2301,freq=2.0), product of:
            0.13732746 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.03903913 = queryNorm
            0.15546128 = fieldWeight in 2301, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.03125 = fieldNorm(doc=2301)
      0.5 = coord(1/2)
    0.013943653 = product of:
      0.027887305 = sum of:
        0.027887305 = weight(_text_:methods in 2301) [ClassicSimilarity], result of:
          0.027887305 = score(doc=2301,freq=2.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.17767884 = fieldWeight in 2301, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03125 = fieldNorm(doc=2301)
      0.5 = coord(1/2)
  0.33333334 = coord(2/6)
```
Abstract

Analytico-synthetic and faceted classifications, such as Universal Decimal Classification (UDC) express content of documents with complex, pre-combined classification codes. Without classification authority control that would help manage and access structured notations, the use of UDC codes in searching and browsing is limited. Existing UDC parsing solutions are usually created for a particular database system or a specific task and are not widely applicable. The approach described in this paper provides a solution by which the analysis and interpretation of UDC notations would be stored into an intermediate format (in this case, in XML) by automatic means without any data or information loss. Due to its richness, the output file can be converted into different formats, such as standard mark-up and data exchange formats or simple lists of the recommended entry points of a UDC number. The program can also be used to create authority records containing complex UDC numbers which can be comprehensively analysed in order to be retrieved effectively. The Java program, as well as the corresponding schema definition it employs, is under continuous development. The current version of the interpreter software is now available online for testing purposes at the following web site: http://interpreter-eto.rhcloud.com. The future plan is to implement conversion methods for standard formats and to create standard online interfaces in order to make it possible to use the features of software as a service. This would result in the algorithm being able to be employed both in existing and future library systems to analyse UDC numbers without any significant programming effort.

Source

Classification and authority control: expanding resource discovery: proceedings of the International UDC Seminar 2015, 29-30 October 2015, Lisbon, Portugal. Eds.: Slavic, A. u. M.I. Cordeiro
Khoo, C.S.G.; Ng, K.; Ou, S.: ¬An exploratory study of human clustering of Web pages (2003) 0.01
```
0.00817406 = product of:
  0.04904436 = sum of:
    0.04904436 = sum of:
      0.027887305 = weight(_text_:methods in 2741) [ClassicSimilarity], result of:
        0.027887305 = score(doc=2741,freq=2.0), product of:
          0.15695344 = queryWeight, product of:
            4.0204134 = idf(docFreq=2156, maxDocs=44218)
            0.03903913 = queryNorm
          0.17767884 = fieldWeight in 2741, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            4.0204134 = idf(docFreq=2156, maxDocs=44218)
            0.03125 = fieldNorm(doc=2741)
      0.021157054 = weight(_text_:22 in 2741) [ClassicSimilarity], result of:
        0.021157054 = score(doc=2741,freq=2.0), product of:
          0.1367084 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.03903913 = queryNorm
          0.15476047 = fieldWeight in 2741, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.03125 = fieldNorm(doc=2741)
  0.16666667 = coord(1/6)
```
Abstract

This study seeks to find out how human beings cluster Web pages naturally. Twenty Web pages retrieved by the Northem Light search engine for each of 10 queries were sorted by 3 subjects into categories that were natural or meaningful to them. lt was found that different subjects clustered the same set of Web pages quite differently and created different categories. The average inter-subject similarity of the clusters created was a low 0.27. Subjects created an average of 5.4 clusters for each sorting. The categories constructed can be divided into 10 types. About 1/3 of the categories created were topical. Another 20% of the categories relate to the degree of relevance or usefulness. The rest of the categories were subject-independent categories such as format, purpose, authoritativeness and direction to other sources. The authors plan to develop automatic methods for categorizing Web pages using the common categories created by the subjects. lt is hoped that the techniques developed can be used by Web search engines to automatically organize Web pages retrieved into categories that are natural to users. 1. Introduction The World Wide Web is an increasingly important source of information for people globally because of its ease of access, the ease of publishing, its ability to transcend geographic and national boundaries, its flexibility and heterogeneity and its dynamic nature. However, Web users also find it increasingly difficult to locate relevant and useful information in this vast information storehouse. Web search engines, despite their scope and power, appear to be quite ineffective. They retrieve too many pages, and though they attempt to rank retrieved pages in order of probable relevance, often the relevant documents do not appear in the top-ranked 10 or 20 documents displayed. Several studies have found that users do not know how to use the advanced features of Web search engines, and do not know how to formulate and re-formulate queries. Users also typically exert minimal effort in performing, evaluating and refining their searches, and are unwilling to scan more than 10 or 20 items retrieved (Jansen, Spink, Bateman & Saracevic, 1998). This suggests that the conventional ranked-list display of search results does not satisfy user requirements, and that better ways of presenting and summarizing search results have to be developed. One promising approach is to group retrieved pages into clusters or categories to allow users to navigate immediately to the "promising" clusters where the most useful Web pages are likely to be located. This approach has been adopted by a number of search engines (notably Northem Light) and search agents.

Date

12. 9.2004 9:56:22
Yang, Y.; Liu, X.: ¬A re-examination of text categorization methods (1999) 0.01
```
0.008133798 = product of:
  0.048802786 = sum of:
    0.048802786 = product of:
      0.09760557 = sum of:
        0.09760557 = weight(_text_:methods in 3386) [ClassicSimilarity], result of:
          0.09760557 = score(doc=3386,freq=8.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.62187594 = fieldWeight in 3386, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.0546875 = fieldNorm(doc=3386)
      0.5 = coord(1/2)
  0.16666667 = coord(1/6)
```
Abstract

This paper reports a controlled study with statistical significance tests an five text categorization methods: the Support Vector Machines (SVM), a k-Nearest Neighbor (kNN) classifier, a neural network (NNet) approach, the Linear Leastsquares Fit (LLSF) mapping and a Naive Bayes (NB) classifier. We focus an the robustness of these methods in dealing with a skewed category distribution, and their performance as function of the training-set category frequency. Our results show that SVM, kNN and LLSF significantly outperform NNet and NB when the number of positive training instances per category are small (less than ten, and that all the methods perform comparably when the categories are sufficiently common (over 300 instances).

Panyr, J.: STEINADLER: ein Verfahren zur automatischen Deskribierung und zur automatischen thematischen Klassifikation (1978) 0.01

0.007116368 = product of:
  0.04269821 = sum of:
    0.04269821 = product of:
      0.08539642 = sum of:
        0.08539642 = weight(_text_:29 in 5169) [ClassicSimilarity], result of:
          0.08539642 = score(doc=5169,freq=2.0), product of:
            0.13732746 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.03903913 = queryNorm
            0.6218451 = fieldWeight in 5169, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.125 = fieldNorm(doc=5169)
      0.5 = coord(1/2)
  0.16666667 = coord(1/6)

Source: Nachrichten für Dokumentation. 29(1978), S.92-96

Schaalje, G.B.; Blades, N.J.; Funai, T.: ¬An open-set size-adjusted Bayesian classifier for authorship attribution (2013) 0.01
```
0.006037779 = product of:
  0.036226675 = sum of:
    0.036226675 = product of:
      0.07245335 = sum of:
        0.07245335 = weight(_text_:methods in 1041) [ClassicSimilarity], result of:
          0.07245335 = score(doc=1041,freq=6.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.4616232 = fieldWeight in 1041, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.046875 = fieldNorm(doc=1041)
      0.5 = coord(1/2)
  0.16666667 = coord(1/6)
```
Abstract

Recent studies of authorship attribution have used machine-learning methods including regularized multinomial logistic regression, neural nets, support vector machines, and the nearest shrunken centroid classifier to identify likely authors of disputed texts. These methods are all limited by an inability to perform open-set classification and account for text and corpus size. We propose a customized Bayesian logit-normal-beta-binomial classification model for supervised authorship attribution. The model is based on the beta-binomial distribution with an explicit inverse relationship between extra-binomial variation and text size. The model internally estimates the relationship of extra-binomial variation to text size, and uses Markov Chain Monte Carlo (MCMC) to produce distributions of posterior authorship probabilities instead of point estimates. We illustrate the method by training the machine-learning methods as well as the open-set Bayesian classifier on undisputed papers of The Federalist, and testing the method on documents historically attributed to Alexander Hamilton, John Jay, and James Madison. The Bayesian classifier was the best classifier of these texts.

McKiernan, G.: Automated categorisation of Web resources : a profile of selected projects, research, products, and services (1996) 0.01

0.0058098556 = product of:
  0.034859132 = sum of:
    0.034859132 = product of:
      0.069718264 = sum of:
        0.069718264 = weight(_text_:methods in 2533) [ClassicSimilarity], result of:
          0.069718264 = score(doc=2533,freq=2.0), product of:
            0.15695344 = queryWeight, product of:
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.03903913 = queryNorm
            0.4441971 = fieldWeight in 2533, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.0204134 = idf(docFreq=2156, maxDocs=44218)
              0.078125 = fieldNorm(doc=2533)
      0.5 = coord(1/2)
  0.16666667 = coord(1/6)

Abstract: Profiles several representative current efforts that apply established as well as more innovative methods of automated classification, organization or other method of categorisation of WWW resources

Search (84 results, page 1 of 5)

Authors

Years

Languages

Types

Themes

Subjects