Search (45 results, page 1 of 3)

HaCohen-Kerner, Y. et al.: Classification using various machine learning methods and combinations of key-phrases and visual features (2016) 0.02

0.022905817 = product of:
  0.045811635 = sum of:
    0.045811635 = product of:
      0.06871745 = sum of:
        0.0067215143 = weight(_text_:a in 2748) [ClassicSimilarity], result of:
          0.0067215143 = score(doc=2748,freq=2.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.12739488 = fieldWeight in 2748, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.078125 = fieldNorm(doc=2748)
        0.061995935 = weight(_text_:22 in 2748) [ClassicSimilarity], result of:
          0.061995935 = score(doc=2748,freq=2.0), product of:
            0.16023713 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.045758117 = queryNorm
            0.38690117 = fieldWeight in 2748, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=2748)
      0.6666667 = coord(2/3)
  0.5 = coord(1/2)

Date: 1. 2.2016 18:25:22
Type: a

Egbert, J.; Biber, D.; Davies, M.: Developing a bottom-up, user-based method of web register classification (2015) 0.02

0.016857736 = product of:
  0.03371547 = sum of:
    0.03371547 = product of:
      0.050573207 = sum of:
        0.013375646 = weight(_text_:a in 2158) [ClassicSimilarity], result of:
          0.013375646 = score(doc=2158,freq=22.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.25351265 = fieldWeight in 2158, product of:
              4.690416 = tf(freq=22.0), with freq of:
                22.0 = termFreq=22.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=2158)
        0.03719756 = weight(_text_:22 in 2158) [ClassicSimilarity], result of:
          0.03719756 = score(doc=2158,freq=2.0), product of:
            0.16023713 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.045758117 = queryNorm
            0.23214069 = fieldWeight in 2158, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=2158)
      0.6666667 = coord(2/3)
  0.5 = coord(1/2)

Abstract: This paper introduces a project to develop a reliable, cost-effective method for classifying Internet texts into register categories, and apply that approach to the analysis of a large corpus of web documents. To date, the project has proceeded in 2 key phases. First, we developed a bottom-up method for web register classification, asking end users of the web to utilize a decision-tree survey to code relevant situational characteristics of web documents, resulting in a bottom-up identification of register and subregister categories. We present details regarding the development and testing of this method through a series of 10 pilot studies. Then, in the second phase of our project we applied this procedure to a corpus of 53,000 web documents. An analysis of the results demonstrates the effectiveness of these methods for web register classification and provides a preliminary description of the types and distribution of registers on the web.
Date: 4. 8.2015 19:22:04
Type: a

Zhu, W.Z.; Allen, R.B.: Document clustering using the LSI subspace signature model (2013) 0.01
```
0.014727588 = product of:
  0.029455176 = sum of:
    0.029455176 = product of:
      0.044182763 = sum of:
        0.006985203 = weight(_text_:a in 690) [ClassicSimilarity], result of:
          0.006985203 = score(doc=690,freq=6.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.13239266 = fieldWeight in 690, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=690)
        0.03719756 = weight(_text_:22 in 690) [ClassicSimilarity], result of:
          0.03719756 = score(doc=690,freq=2.0), product of:
            0.16023713 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.045758117 = queryNorm
            0.23214069 = fieldWeight in 690, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=690)
      0.6666667 = coord(2/3)
  0.5 = coord(1/2)
```
Abstract

We describe the latent semantic indexing subspace signature model (LSISSM) for semantic content representation of unstructured text. Grounded on singular value decomposition, the model represents terms and documents by the distribution signatures of their statistical contribution across the top-ranking latent concept dimensions. LSISSM matches term signatures with document signatures according to their mapping coherence between latent semantic indexing (LSI) term subspace and LSI document subspace. LSISSM does feature reduction and finds a low-rank approximation of scalable and sparse term-document matrices. Experiments demonstrate that this approach significantly improves the performance of major clustering algorithms such as standard K-means and self-organizing maps compared with the vector space model and the traditional LSI model. The unique contribution ranking mechanism in LSISSM also improves the initialization of standard K-means compared with random seeding procedure, which sometimes causes low efficiency and effectiveness of clustering. A two-stage initialization strategy based on LSISSM significantly reduces the running time of standard K-means procedures.

Date

23. 3.2013 13:22:36

Type

a
Liu, R.-L.: ¬A passage extractor for classification of disease aspect information (2013) 0.01
```
0.014213324 = product of:
  0.028426647 = sum of:
    0.028426647 = product of:
      0.04263997 = sum of:
        0.011642005 = weight(_text_:a in 1107) [ClassicSimilarity], result of:
          0.011642005 = score(doc=1107,freq=24.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.22065444 = fieldWeight in 1107, product of:
              4.8989797 = tf(freq=24.0), with freq of:
                24.0 = termFreq=24.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1107)
        0.030997967 = weight(_text_:22 in 1107) [ClassicSimilarity], result of:
          0.030997967 = score(doc=1107,freq=2.0), product of:
            0.16023713 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.045758117 = queryNorm
            0.19345059 = fieldWeight in 1107, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1107)
      0.6666667 = coord(2/3)
  0.5 = coord(1/2)
```
Abstract

Retrieval of disease information is often based on several key aspects such as etiology, diagnosis, treatment, prevention, and symptoms of diseases. Automatic identification of disease aspect information is thus essential. In this article, I model the aspect identification problem as a text classification (TC) problem in which a disease aspect corresponds to a category. The disease aspect classification problem poses two challenges to classifiers: (a) a medical text often contains information about multiple aspects of a disease and hence produces noise for the classifiers and (b) text classifiers often cannot extract the textual parts (i.e., passages) about the categories of interest. I thus develop a technique, PETC (Passage Extractor for Text Classification), that extracts passages (from medical texts) for the underlying text classifiers to classify. Case studies on thousands of Chinese and English medical texts show that PETC enhances a support vector machine (SVM) classifier in classifying disease aspect information. PETC also performs better than three state-of-the-art classifier enhancement techniques, including two passage extraction techniques for text classifiers and a technique that employs term proximity information to enhance text classifiers. The contribution is of significance to evidence-based medicine, health education, and healthcare decision support. PETC can be used in those application domains in which a text to be classified may have several parts about different categories.

Date

28.10.2013 19:22:57

Type

a
Teich, E.; Degaetano-Ortlieb, S.; Fankhauser, P.; Kermes, H.; Lapshinova-Koltunski, E.: ¬The linguistic construal of disciplinarity : a data-mining approach using register features (2016) 0.01
```
0.008929739 = product of:
  0.017859478 = sum of:
    0.017859478 = product of:
      0.026789214 = sum of:
        0.008065818 = weight(_text_:a in 3015) [ClassicSimilarity], result of:
          0.008065818 = score(doc=3015,freq=8.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.15287387 = fieldWeight in 3015, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=3015)
        0.018723397 = weight(_text_:h in 3015) [ClassicSimilarity], result of:
          0.018723397 = score(doc=3015,freq=2.0), product of:
            0.113683715 = queryWeight, product of:
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.045758117 = queryNorm
            0.16469726 = fieldWeight in 3015, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.046875 = fieldNorm(doc=3015)
      0.6666667 = coord(2/3)
  0.5 = coord(1/2)
```
Abstract

We analyze the linguistic evolution of selected scientific disciplines over a 30-year time span (1970s to 2000s). Our focus is on four highly specialized disciplines at the boundaries of computer science that emerged during that time: computational linguistics, bioinformatics, digital construction, and microelectronics. Our analysis is driven by the question whether these disciplines develop a distinctive language use-both individually and collectively-over the given time period. The data set is the English Scientific Text Corpus (scitex), which includes texts from the 1970s/1980s and early 2000s. Our theoretical basis is register theory. In terms of methods, we combine corpus-based methods of feature extraction (various aggregated features [part-of-speech based], n-grams, lexico-grammatical patterns) and automatic text classification. The results of our research are directly relevant to the study of linguistic variation and languages for specific purposes (LSP) and have implications for various natural language processing (NLP) tasks, for example, authorship attribution, text mining, or training NLP tools.

Type

a

Kasprzik, A.: Automatisierte und semiautomatisierte Klassifizierung : eine Analyse aktueller Projekte (2014) 0.01

0.008142265 = product of:
  0.01628453 = sum of:
    0.01628453 = product of:
      0.024426792 = sum of:
        0.0057033943 = weight(_text_:a in 2470) [ClassicSimilarity], result of:
          0.0057033943 = score(doc=2470,freq=4.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.10809815 = fieldWeight in 2470, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=2470)
        0.018723397 = weight(_text_:h in 2470) [ClassicSimilarity], result of:
          0.018723397 = score(doc=2470,freq=2.0), product of:
            0.113683715 = queryWeight, product of:
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.045758117 = queryNorm
            0.16469726 = fieldWeight in 2470, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.046875 = fieldNorm(doc=2470)
      0.6666667 = coord(2/3)
  0.5 = coord(1/2)

Source: Perspektive Bibliothek. 3(2014) H.1, S.85-110
Type: a

Suominen, A.; Toivanen, H.: Map of science with topic modeling : comparison of unsupervised learning and human-assigned subject classification (2016) 0.01
```
0.007944992 = product of:
  0.015889984 = sum of:
    0.015889984 = product of:
      0.023834974 = sum of:
        0.008232141 = weight(_text_:a in 3121) [ClassicSimilarity], result of:
          0.008232141 = score(doc=3121,freq=12.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.15602624 = fieldWeight in 3121, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3121)
        0.015602832 = weight(_text_:h in 3121) [ClassicSimilarity], result of:
          0.015602832 = score(doc=3121,freq=2.0), product of:
            0.113683715 = queryWeight, product of:
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.045758117 = queryNorm
            0.13724773 = fieldWeight in 3121, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3121)
      0.6666667 = coord(2/3)
  0.5 = coord(1/2)
```
Abstract

The delineation of coordinates is fundamental for the cartography of science, and accurate and credible classification of scientific knowledge presents a persistent challenge in this regard. We present a map of Finnish science based on unsupervised-learning classification, and discuss the advantages and disadvantages of this approach vis-à-vis those generated by human reasoning. We conclude that from theoretical and practical perspectives there exist several challenges for human reasoning-based classification frameworks of scientific knowledge, as they typically try to fit new-to-the-world knowledge into historical models of scientific knowledge, and cannot easily be deployed for new large-scale data sets. Automated classification schemes, in contrast, generate classification models only from the available text corpus, thereby identifying credibly novel bodies of knowledge. They also lend themselves to versatile large-scale data analysis, and enable a range of Big Data possibilities. However, we also argue that it is neither possible nor fruitful to declare one or another method a superior approach in terms of realism to classify scientific knowledge, and we believe that the merits of each approach are dependent on the practical objectives of analysis.

Type

a
Qu, B.; Cong, G.; Li, C.; Sun, A.; Chen, H.: ¬An evaluation of classification models for question topic categorization (2012) 0.01
```
0.0077059045 = product of:
  0.015411809 = sum of:
    0.015411809 = product of:
      0.023117714 = sum of:
        0.007514882 = weight(_text_:a in 237) [ClassicSimilarity], result of:
          0.007514882 = score(doc=237,freq=10.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.14243183 = fieldWeight in 237, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=237)
        0.015602832 = weight(_text_:h in 237) [ClassicSimilarity], result of:
          0.015602832 = score(doc=237,freq=2.0), product of:
            0.113683715 = queryWeight, product of:
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.045758117 = queryNorm
            0.13724773 = fieldWeight in 237, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.0390625 = fieldNorm(doc=237)
      0.6666667 = coord(2/3)
  0.5 = coord(1/2)
```
Abstract

We study the problem of question topic classification using a very large real-world Community Question Answering (CQA) dataset from Yahoo! Answers. The dataset comprises 3.9 million questions and these questions are organized into more than 1,000 categories in a hierarchy. To the best knowledge, this is the first systematic evaluation of the performance of different classification methods on question topic classification as well as short texts. Specifically, we empirically evaluate the following in classifying questions into CQA categories: (a) the usefulness of n-gram features and bag-of-word features; (b) the performance of three standard classification algorithms (naive Bayes, maximum entropy, and support vector machines); (c) the performance of the state-of-the-art hierarchical classification algorithms; (d) the effect of training data size on performance; and (e) the effectiveness of the different components of CQA data, including subject, content, asker, and the best answer. The experimental results show what aspects are important for question topic classification in terms of both effectiveness and efficiency. We believe that the experimental findings from this study will be useful in real-world classification problems.

Type

a
Fang, H.: Classifying research articles in multidisciplinary sciences journals into subject categories (2015) 0.01
```
0.0077059045 = product of:
  0.015411809 = sum of:
    0.015411809 = product of:
      0.023117714 = sum of:
        0.007514882 = weight(_text_:a in 2194) [ClassicSimilarity], result of:
          0.007514882 = score(doc=2194,freq=10.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.14243183 = fieldWeight in 2194, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2194)
        0.015602832 = weight(_text_:h in 2194) [ClassicSimilarity], result of:
          0.015602832 = score(doc=2194,freq=2.0), product of:
            0.113683715 = queryWeight, product of:
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.045758117 = queryNorm
            0.13724773 = fieldWeight in 2194, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2194)
      0.6666667 = coord(2/3)
  0.5 = coord(1/2)
```
Abstract

In the Thomson Reuters Web of Science database, the subject categories of a journal are applied to all articles in the journal. However, many articles in multidisciplinary Sciences journals may only be represented by a small number of subject categories. To provide more accurate information on the research areas of articles in such journals, we can classify articles in these journals into subject categories as defined by Web of Science based on their references. For an article in a multidisciplinary sciences journal, the method counts the subject categories in all of the article's references indexed by Web of Science, and uses the most numerous subject categories of the references to determine the most appropriate classification of the article. We used articles in an issue of Proceedings of the National Academy of Sciences (PNAS) to validate the correctness of the method by comparing the obtained results with the categories of the articles as defined by PNAS and their content. This study shows that the method provides more precise search results for the subject category of interest in bibliometric investigations through recognition of articles in multidisciplinary sciences journals whose work relates to a particular subject category.

Type

a
AlQenaei, Z.M.; Monarchi, D.E.: ¬The use of learning techniques to analyze the results of a manual classification system (2016) 0.01
```
0.0077059045 = product of:
  0.015411809 = sum of:
    0.015411809 = product of:
      0.023117714 = sum of:
        0.007514882 = weight(_text_:a in 2836) [ClassicSimilarity], result of:
          0.007514882 = score(doc=2836,freq=10.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.14243183 = fieldWeight in 2836, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2836)
        0.015602832 = weight(_text_:h in 2836) [ClassicSimilarity], result of:
          0.015602832 = score(doc=2836,freq=2.0), product of:
            0.113683715 = queryWeight, product of:
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.045758117 = queryNorm
            0.13724773 = fieldWeight in 2836, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2836)
      0.6666667 = coord(2/3)
  0.5 = coord(1/2)
```
Abstract

Classification is the process of assigning objects to pre-defined classes based on observations or characteristics of those objects, and there are many approaches to performing this task. The overall objective of this study is to demonstrate the use of two learning techniques to analyze the results of a manual classification system. Our sample consisted of 1,026 documents, from the ACM Computing Classification System, classified by their authors as belonging to one of the groups of the classification system: "H.3 Information Storage and Retrieval." A singular value decomposition of the documents' weighted term-frequency matrix was used to represent each document in a 50-dimensional vector space. The analysis of the representation using both supervised (decision tree) and unsupervised (clustering) techniques suggests that two pairs of the ACM classes are closely related to each other in the vector space. Class 1 (Content Analysis and Indexing) is closely related to Class 3 (Information Search and Retrieval), and Class 4 (Systems and Software) is closely related to Class 5 (Online Information Services). Further analysis was performed to test the diffusion of the words in the two classes using both cosine and Euclidean distance.

Type

a
Wang, H.; Hong, M.: Supervised Hebb rule based feature selection for text classification (2019) 0.01
```
0.0077059045 = product of:
  0.015411809 = sum of:
    0.015411809 = product of:
      0.023117714 = sum of:
        0.007514882 = weight(_text_:a in 5036) [ClassicSimilarity], result of:
          0.007514882 = score(doc=5036,freq=10.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.14243183 = fieldWeight in 5036, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5036)
        0.015602832 = weight(_text_:h in 5036) [ClassicSimilarity], result of:
          0.015602832 = score(doc=5036,freq=2.0), product of:
            0.113683715 = queryWeight, product of:
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.045758117 = queryNorm
            0.13724773 = fieldWeight in 5036, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5036)
      0.6666667 = coord(2/3)
  0.5 = coord(1/2)
```
Abstract

Text documents usually contain high dimensional non-discriminative (irrelevant and noisy) terms which lead to steep computational costs and poor learning performance of text classification. One of the effective solutions for this problem is feature selection which aims to identify discriminative terms from text data. This paper proposes a method termed "Hebb rule based feature selection (HRFS)". HRFS is based on supervised Hebb rule and assumes that terms and classes are neurons and select terms under the assumption that a term is discriminative if it keeps "exciting" the corresponding classes. This assumption can be explained as "a term is highly correlated with a class if it is able to keep "exciting" the class according to the original Hebb postulate. Six benchmarking datasets are used to compare HRFS with other seven feature selection methods. Experimental results indicate that HRFS is effective to achieve better performance than the compared methods. HRFS can identify discriminative terms in the view of synapse between neurons. Moreover, HRFS is also efficient because it can be described in the view of matrix operation to decrease complexity of feature selection.

Type

a
Billal, B.; Fonseca, A.; Sadat, F.; Lounis, H.: Semi-supervised learning and social media text analysis towards multi-labeling categorization (2017) 0.01
```
0.0071331207 = product of:
  0.014266241 = sum of:
    0.014266241 = product of:
      0.021399362 = sum of:
        0.008917097 = weight(_text_:a in 4095) [ClassicSimilarity], result of:
          0.008917097 = score(doc=4095,freq=22.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.16900843 = fieldWeight in 4095, product of:
              4.690416 = tf(freq=22.0), with freq of:
                22.0 = termFreq=22.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.03125 = fieldNorm(doc=4095)
        0.012482265 = weight(_text_:h in 4095) [ClassicSimilarity], result of:
          0.012482265 = score(doc=4095,freq=2.0), product of:
            0.113683715 = queryWeight, product of:
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.045758117 = queryNorm
            0.10979818 = fieldWeight in 4095, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.03125 = fieldNorm(doc=4095)
      0.6666667 = coord(2/3)
  0.5 = coord(1/2)
```
Abstract

In traditional text classification, classes are mutually exclusive, i.e. it is not possible to have one text or text fragment classified into more than one class. On the other hand, in multi-label classification an individual text may belong to several classes simultaneously. This type of classification is required by a large number of current applications such as big data classification, images and video annotation. Supervised learning is the most used type of machine learning in the classification task. It requires large quantities of labeled data and the intervention of a human tagger in the creation of the training sets. When the data sets become very large or heavily noisy, this operation can be tedious, prone to error and time consuming. In this case, semi-supervised learning, which requires only few labels, is a better choice. In this paper, we study and evaluate several methods to address the problem of multi-label classification using semi-supervised learning and data from social networks. First, we propose a linguistic pre-processing involving tokeni-sation, recognition of named entities and hashtag segmentation in order to decrease the noise in this type of massive and unstructured real data and then we perform a word sense disambiguation using WordNet. Second, several experiments related to multi-label classification and semi-supervised learning are carried out on these data sets and compared to each other. These evaluations compare the results of the approaches considered. This paper proposes a method for combining semi-supervised methods with a graph method for the extraction of subjects in social networks using a multi-label classification approach. Experiments show that the performance of the proposed model increases in 4 p.p. the precision of the classification when compared to a baseline.

Type

a
HaCohen-Kerner, Y.; Beck, H.; Yehudai, E.; Rosenstein, M.; Mughaz, D.: Cuisine : classification using stylistic feature sets and/or name-based feature sets (2010) 0.01
```
0.0067852205 = product of:
  0.013570441 = sum of:
    0.013570441 = product of:
      0.02035566 = sum of:
        0.0047528287 = weight(_text_:a in 3706) [ClassicSimilarity], result of:
          0.0047528287 = score(doc=3706,freq=4.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.090081796 = fieldWeight in 3706, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3706)
        0.015602832 = weight(_text_:h in 3706) [ClassicSimilarity], result of:
          0.015602832 = score(doc=3706,freq=2.0), product of:
            0.113683715 = queryWeight, product of:
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.045758117 = queryNorm
            0.13724773 = fieldWeight in 3706, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3706)
      0.6666667 = coord(2/3)
  0.5 = coord(1/2)
```
Abstract

Document classification presents challenges due to the large number of features, their dependencies, and the large number of training documents. In this research, we investigated the use of six stylistic feature sets (including 42 features) and/or six name-based feature sets (including 234 features) for various combinations of the following classification tasks: ethnic groups of the authors and/or periods of time when the documents were written and/or places where the documents were written. The investigated corpus contains Jewish Law articles written in Hebrew-Aramaic, which present interesting problems for classification. Our system CUISINE (Classification UsIng Stylistic feature sets and/or NamE-based feature sets) achieves accuracy results between 90.71 to 98.99% for the seven classification experiments (ethnicity, time, place, ethnicity&time, ethnicity&place, time&place, ethnicity&time&place). For the first six tasks, the stylistic feature sets in general and the quantitative feature set in particular are enough for excellent classification results. In contrast, the name-based feature sets are rather poor for these tasks. However, for the most complex task (ethnicity&time&place), a hill-climbing model using all feature sets succeeds in significantly improving the classification results. Most of the stylistic features (34 of 42) are language-independent and domain-independent. These features might be useful to the community at large, at least for rather simple tasks.

Type

a

Groß, T.; Faden, M.: Automatische Indexierung elektronischer Dokumente an der Deutschen Zentralbibliothek für Wirtschaftswissenschaften : Bericht über die Jahrestagung der Internationalen Buchwissenschaftlichen Gesellschaft (2010) 0.01

0.005056957 = product of:
  0.010113914 = sum of:
    0.010113914 = product of:
      0.01517087 = sum of:
        0.0026886058 = weight(_text_:a in 4051) [ClassicSimilarity], result of:
          0.0026886058 = score(doc=4051,freq=2.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.050957955 = fieldWeight in 4051, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.03125 = fieldNorm(doc=4051)
        0.012482265 = weight(_text_:h in 4051) [ClassicSimilarity], result of:
          0.012482265 = score(doc=4051,freq=2.0), product of:
            0.113683715 = queryWeight, product of:
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.045758117 = queryNorm
            0.10979818 = fieldWeight in 4051, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.03125 = fieldNorm(doc=4051)
      0.6666667 = coord(2/3)
  0.5 = coord(1/2)

Source: Bibliotheksdienst. 44(2010) H.12, S.1120-1135
Type: a

Malo, P.; Sinha, A.; Wallenius, J.; Korhonen, P.: Concept-based document classification using Wikipedia and value function (2011) 0.00
```
0.0020164545 = product of:
  0.004032909 = sum of:
    0.004032909 = product of:
      0.012098727 = sum of:
        0.012098727 = weight(_text_:a in 4948) [ClassicSimilarity], result of:
          0.012098727 = score(doc=4948,freq=18.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.22931081 = fieldWeight in 4948, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=4948)
      0.33333334 = coord(1/3)
  0.5 = coord(1/2)
```
Abstract

In this article, we propose a new concept-based method for document classification. The conceptual knowledge associated with the words is drawn from Wikipedia. The purpose is to utilize the abundant semantic relatedness information available in Wikipedia in an efficient value function-based query learning algorithm. The procedure learns the value function by solving a simple linear programming problem formulated using the training documents. The learning involves a step-wise iterative process that helps in generating a value function with an appropriate set of concepts (dimensions) chosen from a collection of concepts. Once the value function is formulated, it is utilized to make a decision between relevance and irrelevance. The value assigned to a particular document from the value function can be further used to rank the documents according to their relevance. Reuters newswire documents have been used to evaluate the efficacy of the procedure. An extensive comparison with other frameworks has been performed. The results are promising.

Type

a
Maghsoodi, N.; Homayounpour, M.M.: Improving Farsi multiclass text classification using a thesaurus and two-stage feature selection (2011) 0.00
```
0.0018577286 = product of:
  0.0037154572 = sum of:
    0.0037154572 = product of:
      0.011146371 = sum of:
        0.011146371 = weight(_text_:a in 4775) [ClassicSimilarity], result of:
          0.011146371 = score(doc=4775,freq=22.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.21126054 = fieldWeight in 4775, product of:
              4.690416 = tf(freq=22.0), with freq of:
                22.0 = termFreq=22.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4775)
      0.33333334 = coord(1/3)
  0.5 = coord(1/2)
```
Abstract

The progressive increase of information content has recently made it necessary to create a system for automatic classification of documents. In this article, a system is presented for the categorization of multiclass Farsi documents that requires fewer training examples and can help to compensate the shortcoming of the standard training dataset. The new idea proposed in the present article is based on extending the feature vector by adding some words extracted from a thesaurus and then filtering the new feature vector by applying secondary feature selection to discard inappropriate features. In fact, a phase of secondary feature selection is applied to choose more appropriate features among the features added from a thesaurus to enhance the effect of using a thesaurus on the efficiency of the classifier. To evaluate the proposed system, a corpus is gathered from the Farsi Wikipedia website and some articles in the Hamshahri newspaper, the Roshd periodical, and the Soroush magazine. In addition to studying the role of a thesaurus and applying secondary feature selection, the effect of a various number of categories, size of the training dataset, and average number of words in the test data also are examined. As the results indicate, classification efficiency improves by applying this approach, especially when available data is not sufficient for some text categories.

Type

a
Borodin, Y.; Polishchuk, V.; Mahmud, J.; Ramakrishnan, I.V.; Stent, A.: Live and learn from mistakes : a lightweight system for document classification (2013) 0.00
```
0.0017712747 = product of:
  0.0035425494 = sum of:
    0.0035425494 = product of:
      0.010627648 = sum of:
        0.010627648 = weight(_text_:a in 2722) [ClassicSimilarity], result of:
          0.010627648 = score(doc=2722,freq=20.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.20142901 = fieldWeight in 2722, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2722)
      0.33333334 = coord(1/3)
  0.5 = coord(1/2)
```
Abstract

We present a Life-Long Learning from Mistakes (3LM) algorithm for document classification, which could be used in various scenarios such as spam filtering, blog classification, and web resource categorization. We extend the ideas of online clustering and batch-mode centroid-based classification to online learning with negative feedback. The 3LM is a competitive learning algorithm, which avoids over-smoothing, characteristic of the centroid-based classifiers, by using a different class representative, which we call clusterhead. The clusterheads competing for vector-space dominance are drawn toward misclassified documents, eventually bringing the model to a "balanced state" for a fixed distribution of documents. Subsequently, the clusterheads oscillate between the misclassified documents, heuristically minimizing the rate of misclassifications, an NP-complete problem. Further, the 3LM algorithm prevents over-fitting by "leashing" the clusterheads to their respective centroids. A clusterhead provably converges if its class can be separated by a hyper-plane from all other classes. Lifelong learning with fixed learning rate allows 3LM to adapt to possibly changing distribution of the data and continually learn and unlearn document classes. We report on our experiments, which demonstrate high accuracy of document classification on Reuters21578, OHSUMED, and TREC07p-spam datasets. The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naïve Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.

Type

a
Cortez, E.; Herrera, M.R.; Silva, A.S. da; Moura, E.S. de; Neubert, M.: Lightweight methods for large-scale product categorization (2011) 0.00
```
0.0016464281 = product of:
  0.0032928563 = sum of:
    0.0032928563 = product of:
      0.009878568 = sum of:
        0.009878568 = weight(_text_:a in 4758) [ClassicSimilarity], result of:
          0.009878568 = score(doc=4758,freq=12.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.18723148 = fieldWeight in 4758, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=4758)
      0.33333334 = coord(1/3)
  0.5 = coord(1/2)
```
Abstract

In this article, we present a study about classification methods for large-scale categorization of product offers on e-shopping web sites. We present a study about the performance of previously proposed approaches and deployed a probabilistic approach to model the classification problem. We also studied an alternative way of modeling information about the description of product offers and investigated the usage of price and store of product offers as features adopted in the classification process. Our experiments used two collections of over a million product offers previously categorized by human editors and taxonomies of hundreds of categories from a real e-shopping web site. In these experiments, our method achieved an improvement of up to 9% in the quality of the categorization in comparison with the best baseline we have found.

Type

a
Barbu, E.: What kind of knowledge is in Wikipedia? : unsupervised extraction of properties for similar concepts (2014) 0.00
```
0.0016464281 = product of:
  0.0032928563 = sum of:
    0.0032928563 = product of:
      0.009878568 = sum of:
        0.009878568 = weight(_text_:a in 1547) [ClassicSimilarity], result of:
          0.009878568 = score(doc=1547,freq=12.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.18723148 = fieldWeight in 1547, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=1547)
      0.33333334 = coord(1/3)
  0.5 = coord(1/2)
```
Abstract

This article presents a novel method for extracting knowledge from Wikipedia and a classification schema for annotating the extracted knowledge. Unlike the majority of approaches in the literature, we use the raw Wikipedia text for knowledge acquisition. The main assumption made is that the concepts classified under the same node in a taxonomy are described in a comparable way in Wikipedia. The annotation of the extracted knowledge is done at two levels: ontological and logical. The extracted properties are evaluated in the traditional way, that is, by computing the precision of the extraction procedure and in a clustering task. The second method of evaluation is seldom used in the natural language processing community, but it is regularly employed in cognitive psychology.

Type

a
Ko, Y.: ¬A new term-weighting scheme for text classification using the odds of positive and negative class probabilities (2015) 0.00
```
0.0016464281 = product of:
  0.0032928563 = sum of:
    0.0032928563 = product of:
      0.009878568 = sum of:
        0.009878568 = weight(_text_:a in 2339) [ClassicSimilarity], result of:
          0.009878568 = score(doc=2339,freq=12.0), product of:
            0.052761257 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.045758117 = queryNorm
            0.18723148 = fieldWeight in 2339, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=2339)
      0.33333334 = coord(1/3)
  0.5 = coord(1/2)
```
Abstract

Text classification (TC) is a core technique for text mining and information retrieval. It has been applied to many applications in many different research and industrial areas. Term-weighting schemes assign an appropriate weight to each term to obtain a high TC performance. Although term weighting is one of the important modules for TC and TC has different peculiarities from those in information retrieval, many term-weighting schemes used in information retrieval, such as term frequency-inverse document frequency (tf-idf), have been used in TC in the same manner. The peculiarity of TC that differs most from information retrieval is the existence of class information. This article proposes a new term-weighting scheme that uses class information using positive and negative class distributions. As a result, the proposed scheme, log tf-TRR, consistently performs better than do other schemes using class information as well as traditional schemes such as tf-idf.

Type

a

Search (45 results, page 1 of 3)

Authors

Languages

Types

Themes