Search (66 results, page 1 of 4)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.17

0.17408767 = sum of:
  0.082819656 = product of:
    0.24845897 = sum of:
      0.24845897 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
        0.24845897 = score(doc=562,freq=2.0), product of:
          0.44208363 = queryWeight, product of:
            8.478011 = idf(docFreq=24, maxDocs=44218)
            0.052144732 = queryNorm
          0.56201804 = fieldWeight in 562, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            8.478011 = idf(docFreq=24, maxDocs=44218)
            0.046875 = fieldNorm(doc=562)
    0.33333334 = coord(1/3)
  0.09126801 = sum of:
    0.048878662 = weight(_text_:data in 562) [ClassicSimilarity], result of:
      0.048878662 = score(doc=562,freq=4.0), product of:
        0.16488427 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.052144732 = queryNorm
        0.29644224 = fieldWeight in 562, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.04238935 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
      0.04238935 = score(doc=562,freq=2.0), product of:
        0.18260197 = queryWeight, product of:
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.052144732 = queryNorm
        0.23214069 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.5018296 = idf(docFreq=3622, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)

Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32
Source: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), 1-4 November 2004, Brighton, UK

HaCohen-Kerner, Y. et al.: Classification using various machine learning methods and combinations of key-phrases and visual features (2016) 0.06

0.064126484 = product of:
  0.12825297 = sum of:
    0.12825297 = sum of:
      0.057604056 = weight(_text_:data in 2748) [ClassicSimilarity], result of:
        0.057604056 = score(doc=2748,freq=2.0), product of:
          0.16488427 = queryWeight, product of:
            3.1620505 = idf(docFreq=5088, maxDocs=44218)
            0.052144732 = queryNorm
          0.34936053 = fieldWeight in 2748, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.1620505 = idf(docFreq=5088, maxDocs=44218)
            0.078125 = fieldNorm(doc=2748)
      0.070648916 = weight(_text_:22 in 2748) [ClassicSimilarity], result of:
        0.070648916 = score(doc=2748,freq=2.0), product of:
          0.18260197 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.052144732 = queryNorm
          0.38690117 = fieldWeight in 2748, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.078125 = fieldNorm(doc=2748)
  0.5 = coord(1/2)

Date: 1. 2.2016 18:25:22
Source: Semantic keyword-based search on structured data sources: First COST Action IC1302 International KEYSTONE Conference, IKC 2015, Coimbra, Portugal, September 8-9, 2015. Revised Selected Papers. Eds.: J. Cardoso et al

Dubin, D.: Dimensions and discriminability (1998) 0.04

0.04488854 = product of:
  0.08977708 = sum of:
    0.08977708 = sum of:
      0.040322836 = weight(_text_:data in 2338) [ClassicSimilarity], result of:
        0.040322836 = score(doc=2338,freq=2.0), product of:
          0.16488427 = queryWeight, product of:
            3.1620505 = idf(docFreq=5088, maxDocs=44218)
            0.052144732 = queryNorm
          0.24455236 = fieldWeight in 2338, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.1620505 = idf(docFreq=5088, maxDocs=44218)
            0.0546875 = fieldNorm(doc=2338)
      0.049454242 = weight(_text_:22 in 2338) [ClassicSimilarity], result of:
        0.049454242 = score(doc=2338,freq=2.0), product of:
          0.18260197 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.052144732 = queryNorm
          0.2708308 = fieldWeight in 2338, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0546875 = fieldNorm(doc=2338)
  0.5 = coord(1/2)

Date: 22. 9.1997 19:16:05
Source: Visualizing subject access for 21st century information resources: Papers presented at the 1997 Clinic on Library Applications of Data Processing, 2-4 Mar 1997, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign. Ed.: P.A. Cochrane et al

Yoon, Y.; Lee, C.; Lee, G.G.: ¬An effective procedure for constructing a hierarchical text classification system (2006) 0.04

0.04488854 = product of:
  0.08977708 = sum of:
    0.08977708 = sum of:
      0.040322836 = weight(_text_:data in 5273) [ClassicSimilarity], result of:
        0.040322836 = score(doc=5273,freq=2.0), product of:
          0.16488427 = queryWeight, product of:
            3.1620505 = idf(docFreq=5088, maxDocs=44218)
            0.052144732 = queryNorm
          0.24455236 = fieldWeight in 5273, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.1620505 = idf(docFreq=5088, maxDocs=44218)
            0.0546875 = fieldNorm(doc=5273)
      0.049454242 = weight(_text_:22 in 5273) [ClassicSimilarity], result of:
        0.049454242 = score(doc=5273,freq=2.0), product of:
          0.18260197 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.052144732 = queryNorm
          0.2708308 = fieldWeight in 5273, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0546875 = fieldNorm(doc=5273)
  0.5 = coord(1/2)

Abstract: In text categorization tasks, classification on some class hierarchies has better results than in cases without the hierarchy. Currently, because a large number of documents are divided into several subgroups in a hierarchy, we can appropriately use a hierarchical classification method. However, we have no systematic method to build a hierarchical classification system that performs well with large collections of practical data. In this article, we introduce a new evaluation scheme for internal node classifiers, which can be used effectively to develop a hierarchical classification system. We also show that our method for constructing the hierarchical classification system is very effective, especially for the task of constructing classifiers applied to hierarchy tree with a lot of levels.
Date: 22. 7.2006 16:24:52

Mengle, S.; Goharian, N.: Passage detection using text classification (2009) 0.03
```
0.032063242 = product of:
  0.064126484 = sum of:
    0.064126484 = sum of:
      0.028802028 = weight(_text_:data in 2765) [ClassicSimilarity], result of:
        0.028802028 = score(doc=2765,freq=2.0), product of:
          0.16488427 = queryWeight, product of:
            3.1620505 = idf(docFreq=5088, maxDocs=44218)
            0.052144732 = queryNorm
          0.17468026 = fieldWeight in 2765, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.1620505 = idf(docFreq=5088, maxDocs=44218)
            0.0390625 = fieldNorm(doc=2765)
      0.035324458 = weight(_text_:22 in 2765) [ClassicSimilarity], result of:
        0.035324458 = score(doc=2765,freq=2.0), product of:
          0.18260197 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.052144732 = queryNorm
          0.19345059 = fieldWeight in 2765, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0390625 = fieldNorm(doc=2765)
  0.5 = coord(1/2)
```
Abstract

Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retrieval is the determination of the degree of relevance of blocks of text, namely passages, comprising a document. Rather than determining the relevance of a document in its entirety, passage retrieval determines the relevance of the individual passages. As such, modified traditional information-retrieval techniques compare terms found in user queries with the individual passages to determine a similarity score for passages of interest. In passage detection, passages are classified into predetermined categories. More often than not, passage detection techniques are deployed to detect hidden paragraphs in documents. That is, to hide information, documents are injected with hidden text into passages. Rather than matching query terms against passages to determine their relevance, using text-mining techniques, the passages are classified. Those documents with hidden passages are defined as infected. Thus, simply stated, passage retrieval is the search for passages relevant to a user query, while passage detection is the classification of passages. That is, in passage detection, passages are labeled with one or more categories from a set of predetermined categories. We present a keyword-based dynamic passage approach (KDP) and demonstrate that KDP outperforms statistically significantly (99% confidence) the other document-splitting approaches by 12% to 18% in the passage detection and passage category-prediction tasks. Furthermore, we evaluate the effects of the feature selection, passage length, ambiguous passages, and finally training-data category distribution on passage-detection accuracy.

Date

22. 3.2009 19:14:43
Wu, M.; Liu, Y.-H.; Brownlee, R.; Zhang, X.: Evaluating utility and automatic classification of subject metadata from Research Data Australia (2021) 0.02
```
0.024439331 = product of:
  0.048878662 = sum of:
    0.048878662 = product of:
      0.097757325 = sum of:
        0.097757325 = weight(_text_:data in 453) [ClassicSimilarity], result of:
          0.097757325 = score(doc=453,freq=16.0), product of:
            0.16488427 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.052144732 = queryNorm
            0.5928845 = fieldWeight in 453, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046875 = fieldNorm(doc=453)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

In this paper, we present a case study of how well subject metadata (comprising headings from an international classification scheme) has been deployed in a national data catalogue, and how often data seekers use subject metadata when searching for data. Through an analysis of user search behaviour as recorded in search logs, we find evidence that users utilise the subject metadata for data discovery. Since approximately half of the records ingested by the catalogue did not include subject metadata at the time of harvest, we experimented with automatic subject classification approaches in order to enrich these records and to provide additional support for user search and data discovery. Our results show that automatic methods work well for well represented categories of subject metadata, and these categories tend to have features that can distinguish themselves from the other categories. Our findings raise implications for data catalogue providers; they should invest more effort to enhance the quality of data records by providing an adequate description of these records for under-represented subject categories.

Subramanian, S.; Shafer, K.E.: Clustering (2001) 0.02

0.021194674 = product of:
  0.04238935 = sum of:
    0.04238935 = product of:
      0.0847787 = sum of:
        0.0847787 = weight(_text_:22 in 1046) [ClassicSimilarity], result of:
          0.0847787 = score(doc=1046,freq=2.0), product of:
            0.18260197 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.052144732 = queryNorm
            0.46428138 = fieldWeight in 1046, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.09375 = fieldNorm(doc=1046)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Date: 5. 5.2003 14:17:22

Fong, A.C.M.: Mining a Web citation database for document clustering (2002) 0.02

0.020161418 = product of:
  0.040322836 = sum of:
    0.040322836 = product of:
      0.08064567 = sum of:
        0.08064567 = weight(_text_:data in 3940) [ClassicSimilarity], result of:
          0.08064567 = score(doc=3940,freq=2.0), product of:
            0.16488427 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.052144732 = queryNorm
            0.48910472 = fieldWeight in 3940, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.109375 = fieldNorm(doc=3940)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Theme: Data Mining

Bianchini, C.; Bargioni, S.: Automated classification using linked open data : a case study on faceted classification and Wikidata (2021) 0.02
```
0.020161418 = product of:
  0.040322836 = sum of:
    0.040322836 = product of:
      0.08064567 = sum of:
        0.08064567 = weight(_text_:data in 724) [ClassicSimilarity], result of:
          0.08064567 = score(doc=724,freq=8.0), product of:
            0.16488427 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.052144732 = queryNorm
            0.48910472 = fieldWeight in 724, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.0546875 = fieldNorm(doc=724)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

The Wikidata gadget, CCLitBox, for the automated classification of literary authors and works by a faceted classification and using Linked Open Data (LOD) is presented. The tool reproduces the classification algorithm of class O Literature of the Colon Classification and uses data freely available in Wikidata to create Colon Classification class numbers. CCLitBox is totally free and enables any user to classify literary authors and their works; it is easily accessible to everybody; it uses LOD from Wikidata but missing data for classification can be freely added if necessary; it is readymade for any cooperative and networked project.
Classification, automation, and new media : Proceedings of the 24th Annual Conference of the Gesellschaft für Klassifikation e.V., University of Passau, March 15 - 17, 2000 (2002) 0.02
```
0.01905075 = product of:
  0.0381015 = sum of:
    0.0381015 = product of:
      0.076203 = sum of:
        0.076203 = weight(_text_:data in 5997) [ClassicSimilarity], result of:
          0.076203 = score(doc=5997,freq=14.0), product of:
            0.16488427 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.052144732 = queryNorm
            0.46216056 = fieldWeight in 5997, product of:
              3.7416575 = tf(freq=14.0), with freq of:
                14.0 = termFreq=14.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5997)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Given the huge amount of information in the internet and in practically every domain of knowledge that we are facing today, knowledge discovery calls for automation. The book deals with methods from classification and data analysis that respond effectively to this rapidly growing challenge. The interested reader will find new methodological insights as well as applications in economics, management science, finance, and marketing, and in pattern recognition, biology, health, and archaeology.

Content

Data Analysis, Statistics, and Classification.- Pattern Recognition and Automation.- Data Mining, Information Processing, and Automation.- New Media, Web Mining, and Automation.- Applications in Management Science, Finance, and Marketing.- Applications in Medicine, Biology, Archaeology, and Others.- Author Index.- Subject Index.

RSWK

Data Mining / Kongress / Passau <2000>

Series

Proceedings of the ... annual conference of the Gesellschaft für Klassifikation e.V. ; 24)(Studies in classification, data analysis, and knowledge organization

Subject

Data Mining / Kongress / Passau <2000>

Theme

Data Mining

Reiner, U.: Automatische DDC-Klassifizierung von bibliografischen Titeldatensätzen (2009) 0.02

0.017662229 = product of:
  0.035324458 = sum of:
    0.035324458 = product of:
      0.070648916 = sum of:
        0.070648916 = weight(_text_:22 in 611) [ClassicSimilarity], result of:
          0.070648916 = score(doc=611,freq=2.0), product of:
            0.18260197 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.052144732 = queryNorm
            0.38690117 = fieldWeight in 611, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=611)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Date: 22. 8.2009 12:54:24

Autonomy, Inc.: Automatic classification (o.J.) 0.02

0.016292887 = product of:
  0.032585774 = sum of:
    0.032585774 = product of:
      0.06517155 = sum of:
        0.06517155 = weight(_text_:data in 1666) [ClassicSimilarity], result of:
          0.06517155 = score(doc=1666,freq=4.0), product of:
            0.16488427 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.052144732 = queryNorm
            0.3952563 = fieldWeight in 1666, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.0625 = fieldNorm(doc=1666)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Abstract: Autonomy's Classification solutions remove the necessity for organizations to rely on human intervention or manual processing of information, such as manual tagging, typically required to make most other e-business applications work. Autonomy's ability to consistently and accurately classify data automatically is a unique infrastructure solution that overcomes the predicaments surrounding the exponential growth of unstructured data.

Billal, B.; Fonseca, A.; Sadat, F.; Lounis, H.: Semi-supervised learning and social media text analysis towards multi-labeling categorization (2017) 0.02
```
0.016292887 = product of:
  0.032585774 = sum of:
    0.032585774 = product of:
      0.06517155 = sum of:
        0.06517155 = weight(_text_:data in 4095) [ClassicSimilarity], result of:
          0.06517155 = score(doc=4095,freq=16.0), product of:
            0.16488427 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.052144732 = queryNorm
            0.3952563 = fieldWeight in 4095, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.03125 = fieldNorm(doc=4095)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

In traditional text classification, classes are mutually exclusive, i.e. it is not possible to have one text or text fragment classified into more than one class. On the other hand, in multi-label classification an individual text may belong to several classes simultaneously. This type of classification is required by a large number of current applications such as big data classification, images and video annotation. Supervised learning is the most used type of machine learning in the classification task. It requires large quantities of labeled data and the intervention of a human tagger in the creation of the training sets. When the data sets become very large or heavily noisy, this operation can be tedious, prone to error and time consuming. In this case, semi-supervised learning, which requires only few labels, is a better choice. In this paper, we study and evaluate several methods to address the problem of multi-label classification using semi-supervised learning and data from social networks. First, we propose a linguistic pre-processing involving tokeni-sation, recognition of named entities and hashtag segmentation in order to decrease the noise in this type of massive and unstructured real data and then we perform a word sense disambiguation using WordNet. Second, several experiments related to multi-label classification and semi-supervised learning are carried out on these data sets and compared to each other. These evaluations compare the results of the approaches considered. This paper proposes a method for combining semi-supervised methods with a graph method for the extraction of subjects in social networks using a multi-label classification approach. Experiments show that the performance of the proposed model increases in 4 p.p. the precision of the classification when compared to a baseline.

Source

IEEE International Conference on Big Data (Big Data) (2017)
Salles, T.; Rocha, L.; Gonçalves, M.A.; Almeida, J.M.; Mourão, F.; Meira Jr., W.; Viegas, F.: ¬A quantitative analysis of the temporal effects on automatic text classification (2016) 0.02
```
0.016100824 = product of:
  0.032201648 = sum of:
    0.032201648 = product of:
      0.064403296 = sum of:
        0.064403296 = weight(_text_:data in 3014) [ClassicSimilarity], result of:
          0.064403296 = score(doc=3014,freq=10.0), product of:
            0.16488427 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.052144732 = queryNorm
            0.39059696 = fieldWeight in 3014, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3014)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Automatic text classification (TC) continues to be a relevant research topic and several TC algorithms have been proposed. However, the majority of TC algorithms assume that the underlying data distribution does not change over time. In this work, we are concerned with the challenges imposed by the temporal dynamics observed in textual data sets. We provide evidence of the existence of temporal effects in three textual data sets, reflected by variations observed over time in the class distribution, in the pairwise class similarities, and in the relationships between terms and classes. We then quantify, using a series of full factorial design experiments, the impact of these effects on four well-known TC algorithms. We show that these temporal effects affect each analyzed data set differently and that they restrict the performance of each considered TC algorithm to different extents. The reported quantitative analyses, which are the original contributions of this article, provide valuable new insights to better understand the behavior of TC algorithms when faced with nonstatic (temporal) data distributions and highlight important requirements for the proposal of more accurate classification models.
Liu, X.; Yu, S.; Janssens, F.; Glänzel, W.; Moreau, Y.; Moor, B.de: Weighted hybrid clustering by combining text mining and bibliometrics on a large-scale journal database (2010) 0.01
```
0.014965973 = product of:
  0.029931946 = sum of:
    0.029931946 = product of:
      0.05986389 = sum of:
        0.05986389 = weight(_text_:data in 3464) [ClassicSimilarity], result of:
          0.05986389 = score(doc=3464,freq=6.0), product of:
            0.16488427 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.052144732 = queryNorm
            0.3630661 = fieldWeight in 3464, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046875 = fieldNorm(doc=3464)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

We propose a new hybrid clustering framework to incorporate text mining with bibliometrics in journal set analysis. The framework integrates two different approaches: clustering ensemble and kernel-fusion clustering. To improve the flexibility and the efficiency of processing large-scale data, we propose an information-based weighting scheme to leverage the effect of multiple data sources in hybrid clustering. Three different algorithms are extended by the proposed weighting scheme and they are employed on a large journal set retrieved from the Web of Science (WoS) database. The clustering performance of the proposed algorithms is systematically evaluated using multiple evaluation methods, and they were cross-compared with alternative methods. Experimental results demonstrate that the proposed weighted hybrid clustering strategy is superior to other methods in clustering performance and efficiency. The proposed approach also provides a more refined structural mapping of journal sets, which is useful for monitoring and detecting new trends in different scientific fields.

Theme

Data Mining
Teich, E.; Degaetano-Ortlieb, S.; Fankhauser, P.; Kermes, H.; Lapshinova-Koltunski, E.: ¬The linguistic construal of disciplinarity : a data-mining approach using register features (2016) 0.01
```
0.014965973 = product of:
  0.029931946 = sum of:
    0.029931946 = product of:
      0.05986389 = sum of:
        0.05986389 = weight(_text_:data in 3015) [ClassicSimilarity], result of:
          0.05986389 = score(doc=3015,freq=6.0), product of:
            0.16488427 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.052144732 = queryNorm
            0.3630661 = fieldWeight in 3015, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.046875 = fieldNorm(doc=3015)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

We analyze the linguistic evolution of selected scientific disciplines over a 30-year time span (1970s to 2000s). Our focus is on four highly specialized disciplines at the boundaries of computer science that emerged during that time: computational linguistics, bioinformatics, digital construction, and microelectronics. Our analysis is driven by the question whether these disciplines develop a distinctive language use-both individually and collectively-over the given time period. The data set is the English Scientific Text Corpus (scitex), which includes texts from the 1970s/1980s and early 2000s. Our theoretical basis is register theory. In terms of methods, we combine corpus-based methods of feature extraction (various aggregated features [part-of-speech based], n-grams, lexico-grammatical patterns) and automatic text classification. The results of our research are directly relevant to the study of linguistic variation and languages for specific purposes (LSP) and have implications for various natural language processing (NLP) tasks, for example, authorship attribution, text mining, or training NLP tools.

Theme

Data Mining
Ru, C.; Tang, J.; Li, S.; Xie, S.; Wang, T.: Using semantic similarity to reduce wrong labels in distant supervision for relation extraction (2018) 0.01
```
0.014401014 = product of:
  0.028802028 = sum of:
    0.028802028 = product of:
      0.057604056 = sum of:
        0.057604056 = weight(_text_:data in 5055) [ClassicSimilarity], result of:
          0.057604056 = score(doc=5055,freq=8.0), product of:
            0.16488427 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.052144732 = queryNorm
            0.34936053 = fieldWeight in 5055, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5055)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Distant supervision (DS) has the advantage of automatically generating large amounts of labelled training data and has been widely used for relation extraction. However, there are usually many wrong labels in the automatically labelled data in distant supervision (Riedel, Yao, & McCallum, 2010). This paper presents a novel method to reduce the wrong labels. The proposed method uses the semantic Jaccard with word embedding to measure the semantic similarity between the relation phrase in the knowledge base and the dependency phrases between two entities in a sentence to filter the wrong labels. In the process of reducing wrong labels, the semantic Jaccard algorithm selects a core dependency phrase to represent the candidate relation in a sentence, which can capture features for relation classification and avoid the negative impact from irrelevant term sequences that previous neural network models of relation extraction often suffer. In the process of relation classification, the core dependency phrases are also used as the input of a convolutional neural network (CNN) for relation classification. The experimental results show that compared with the methods using original DS data, the methods using filtered DS data performed much better in relation extraction. It indicates that the semantic similarity based method is effective in reducing wrong labels. The relation extraction performance of the CNN model using the core dependency phrases as input is the best of all, which indicates that using the core dependency phrases as input of CNN is enough to capture the features for relation classification and could avoid negative impact from irrelevant terms.
Han, K.; Rezapour, R.; Nakamura, K.; Devkota, D.; Miller, D.C.; Diesner, J.: ¬An expert-in-the-loop method for domain-specific document categorization based on small training data (2023) 0.01
```
0.014401014 = product of:
  0.028802028 = sum of:
    0.028802028 = product of:
      0.057604056 = sum of:
        0.057604056 = weight(_text_:data in 967) [ClassicSimilarity], result of:
          0.057604056 = score(doc=967,freq=8.0), product of:
            0.16488427 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.052144732 = queryNorm
            0.34936053 = fieldWeight in 967, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.0390625 = fieldNorm(doc=967)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Automated text categorization methods are of broad relevance for domain experts since they free researchers and practitioners from manual labeling, save their resources (e.g., time, labor), and enrich the data with information helpful to study substantive questions. Despite a variety of newly developed categorization methods that require substantial amounts of annotated data, little is known about how to build models when (a) labeling texts with categories requires substantial domain expertise and/or in-depth reading, (b) only a few annotated documents are available for model training, and (c) no relevant computational resources, such as pretrained models, are available. In a collaboration with environmental scientists who study the socio-ecological impact of funded biodiversity conservation projects, we develop a method that integrates deep domain expertise with computational models to automatically categorize project reports based on a small sample of 93 annotated documents. Our results suggest that domain expertise can improve automated categorization and that the magnitude of these improvements is influenced by the experts' understanding of categories and their confidence in their annotation, as well as data sparsity and additional category characteristics such as the portion of exclusive keywords that can identify a category.
Yang, P.; Gao, W.; Tan, Q.; Wong, K.-F.: ¬A link-bridged topic model for cross-domain document classification (2013) 0.01
```
0.012471643 = product of:
  0.024943287 = sum of:
    0.024943287 = product of:
      0.049886573 = sum of:
        0.049886573 = weight(_text_:data in 2706) [ClassicSimilarity], result of:
          0.049886573 = score(doc=2706,freq=6.0), product of:
            0.16488427 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.052144732 = queryNorm
            0.30255508 = fieldWeight in 2706, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2706)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Transfer learning utilizes labeled data available from some related domain (source domain) for achieving effective knowledge transformation to the target domain. However, most state-of-the-art cross-domain classification methods treat documents as plain text and ignore the hyperlink (or citation) relationship existing among the documents. In this paper, we propose a novel cross-domain document classification approach called Link-Bridged Topic model (LBT). LBT consists of two key steps. Firstly, LBT utilizes an auxiliary link network to discover the direct or indirect co-citation relationship among documents by embedding the background knowledge into a graph kernel. The mined co-citation relationship is leveraged to bridge the gap across different domains. Secondly, LBT simultaneously combines the content information and link structures into a unified latent topic model. The model is based on an assumption that the documents of source and target domains share some common topics from the point of view of both content information and link structure. By mapping both domains data into the latent topic spaces, LBT encodes the knowledge about domain commonality and difference as the shared topics with associated differential probabilities. The learned latent topics must be consistent with the source and target data, as well as content and link statistics. Then the shared topics act as the bridge to facilitate knowledge transfer from the source to the target domains. Experiments on different types of datasets show that our algorithm significantly improves the generalization performance of cross-domain document classification.
Suominen, A.; Toivanen, H.: Map of science with topic modeling : comparison of unsupervised learning and human-assigned subject classification (2016) 0.01
```
0.012471643 = product of:
  0.024943287 = sum of:
    0.024943287 = product of:
      0.049886573 = sum of:
        0.049886573 = weight(_text_:data in 3121) [ClassicSimilarity], result of:
          0.049886573 = score(doc=3121,freq=6.0), product of:
            0.16488427 = queryWeight, product of:
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.052144732 = queryNorm
            0.30255508 = fieldWeight in 3121, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.1620505 = idf(docFreq=5088, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3121)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

The delineation of coordinates is fundamental for the cartography of science, and accurate and credible classification of scientific knowledge presents a persistent challenge in this regard. We present a map of Finnish science based on unsupervised-learning classification, and discuss the advantages and disadvantages of this approach vis-à-vis those generated by human reasoning. We conclude that from theoretical and practical perspectives there exist several challenges for human reasoning-based classification frameworks of scientific knowledge, as they typically try to fit new-to-the-world knowledge into historical models of scientific knowledge, and cannot easily be deployed for new large-scale data sets. Automated classification schemes, in contrast, generate classification models only from the available text corpus, thereby identifying credibly novel bodies of knowledge. They also lend themselves to versatile large-scale data analysis, and enable a range of Big Data possibilities. However, we also argue that it is neither possible nor fruitful to declare one or another method a superior approach in terms of realism to classify scientific knowledge, and we believe that the merits of each approach are dependent on the practical objectives of analysis.

Search (66 results, page 1 of 4)

Authors

Years

Languages

Types

Themes

Subjects