Search (42 results, page 1 of 3)

Barbu, E.: What kind of knowledge is in Wikipedia? : unsupervised extraction of properties for similar concepts (2014) 0.06

0.06030063 = product of:
  0.12060126 = sum of:
    0.024705013 = weight(_text_:for in 1547) [ClassicSimilarity], result of:
      0.024705013 = score(doc=1547,freq=10.0), product of:
        0.08876751 = queryWeight, product of:
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.047278564 = queryNorm
        0.27831143 = fieldWeight in 1547, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.046875 = fieldNorm(doc=1547)
    0.095896244 = weight(_text_:computing in 1547) [ClassicSimilarity], result of:
      0.095896244 = score(doc=1547,freq=2.0), product of:
        0.26151994 = queryWeight, product of:
          5.5314693 = idf(docFreq=475, maxDocs=44218)
          0.047278564 = queryNorm
        0.36668807 = fieldWeight in 1547, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.5314693 = idf(docFreq=475, maxDocs=44218)
          0.046875 = fieldNorm(doc=1547)
  0.5 = coord(2/4)

Abstract: This article presents a novel method for extracting knowledge from Wikipedia and a classification schema for annotating the extracted knowledge. Unlike the majority of approaches in the literature, we use the raw Wikipedia text for knowledge acquisition. The main assumption made is that the concepts classified under the same node in a taxonomy are described in a comparable way in Wikipedia. The annotation of the extracted knowledge is done at two levels: ontological and logical. The extracted properties are evaluated in the traditional way, that is, by computing the precision of the extraction procedure and in a clustering task. The second method of evaluation is seldom used in the natural language processing community, but it is regularly employed in cognitive psychology.
Source: Journal of the Association for Information Science and Technology. 65(2014) no.12, S.2489-2497

Chae, G.; Park, J.; Park, J.; Yeo, W.S.; Shi, C.: Linking and clustering artworks using social tags : revitalizing crowd-sourced information on cultural collections (2016) 0.05
```
0.04916378 = product of:
  0.09832756 = sum of:
    0.01841403 = weight(_text_:for in 2852) [ClassicSimilarity], result of:
      0.01841403 = score(doc=2852,freq=8.0), product of:
        0.08876751 = queryWeight, product of:
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.047278564 = queryNorm
        0.20744109 = fieldWeight in 2852, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2852)
    0.079913534 = weight(_text_:computing in 2852) [ClassicSimilarity], result of:
      0.079913534 = score(doc=2852,freq=2.0), product of:
        0.26151994 = queryWeight, product of:
          5.5314693 = idf(docFreq=475, maxDocs=44218)
          0.047278564 = queryNorm
        0.3055734 = fieldWeight in 2852, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          5.5314693 = idf(docFreq=475, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2852)
  0.5 = coord(2/4)
```
Abstract

Social tagging is one of the most popular methods for collecting crowd-sourced information in galleries, libraries, archives, and museums (GLAMs). However, when the number of social tags grows rapidly, using them becomes problematic and, as a result, they are often left as simply big data that cannot be used for practical purposes. To revitalize the use of this crowd-sourced information, we propose using social tags to link and cluster artworks based on an experimental study using an online collection at the Gyeonggi Museum of Modern Art (GMoMA). We view social tagging as a folksonomy, where artworks are classified by keywords of the crowd's various interpretations and one artwork can belong to several different categories simultaneously. To leverage this strength of social tags, we used a clustering method called "link communities" to detect overlapping communities in a network of artworks constructed by computing similarities between all artwork pairs. We used this framework to identify semantic relationships and clusters of similar artworks. By comparing the clustering results with curators' manual classification results, we demonstrated the potential of social tagging data for automatically clustering artworks in a way that reflects the dynamic perspectives of crowds.

Source

Journal of the Association for Information Science and Technology. 67(2016) no.4, S.885-899
AlQenaei, Z.M.; Monarchi, D.E.: ¬The use of learning techniques to analyze the results of a manual classification system (2016) 0.03
```
0.028253702 = product of:
  0.11301481 = sum of:
    0.11301481 = weight(_text_:computing in 2836) [ClassicSimilarity], result of:
      0.11301481 = score(doc=2836,freq=4.0), product of:
        0.26151994 = queryWeight, product of:
          5.5314693 = idf(docFreq=475, maxDocs=44218)
          0.047278564 = queryNorm
        0.43214604 = fieldWeight in 2836, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          5.5314693 = idf(docFreq=475, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2836)
  0.25 = coord(1/4)
```
Abstract

Classification is the process of assigning objects to pre-defined classes based on observations or characteristics of those objects, and there are many approaches to performing this task. The overall objective of this study is to demonstrate the use of two learning techniques to analyze the results of a manual classification system. Our sample consisted of 1,026 documents, from the ACM Computing Classification System, classified by their authors as belonging to one of the groups of the classification system: "H.3 Information Storage and Retrieval." A singular value decomposition of the documents' weighted term-frequency matrix was used to represent each document in a 50-dimensional vector space. The analysis of the representation using both supervised (decision tree) and unsupervised (clustering) techniques suggests that two pairs of the ACM classes are closely related to each other in the vector space. Class 1 (Content Analysis and Indexing) is closely related to Class 3 (Information Search and Retrieval), and Class 4 (Systems and Software) is closely related to Class 5 (Online Information Services). Further analysis was performed to test the diffusion of the words in the two classes using both cosine and Euclidean distance.

Object

Computing Classification System

Egbert, J.; Biber, D.; Davies, M.: Developing a bottom-up, user-based method of web register classification (2015) 0.02

0.020656807 = product of:
  0.041313615 = sum of:
    0.022096837 = weight(_text_:for in 2158) [ClassicSimilarity], result of:
      0.022096837 = score(doc=2158,freq=8.0), product of:
        0.08876751 = queryWeight, product of:
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.047278564 = queryNorm
        0.24892932 = fieldWeight in 2158, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.046875 = fieldNorm(doc=2158)
    0.019216778 = product of:
      0.038433556 = sum of:
        0.038433556 = weight(_text_:22 in 2158) [ClassicSimilarity], result of:
          0.038433556 = score(doc=2158,freq=2.0), product of:
            0.16556148 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.047278564 = queryNorm
            0.23214069 = fieldWeight in 2158, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=2158)
      0.5 = coord(1/2)
  0.5 = coord(2/4)

Abstract: This paper introduces a project to develop a reliable, cost-effective method for classifying Internet texts into register categories, and apply that approach to the analysis of a large corpus of web documents. To date, the project has proceeded in 2 key phases. First, we developed a bottom-up method for web register classification, asking end users of the web to utilize a decision-tree survey to code relevant situational characteristics of web documents, resulting in a bottom-up identification of register and subregister categories. We present details regarding the development and testing of this method through a series of 10 pilot studies. Then, in the second phase of our project we applied this procedure to a corpus of 53,000 web documents. An analysis of the results demonstrates the effectiveness of these methods for web register classification and provides a preliminary description of the types and distribution of registers on the web.
Date: 4. 8.2015 19:22:04
Source: Journal of the Association for Information Science and Technology. 66(2015) no.9, S.1817-1831

Liu, R.-L.: ¬A passage extractor for classification of disease aspect information (2013) 0.02
```
0.019283235 = product of:
  0.03856647 = sum of:
    0.022552488 = weight(_text_:for in 1107) [ClassicSimilarity], result of:
      0.022552488 = score(doc=1107,freq=12.0), product of:
        0.08876751 = queryWeight, product of:
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.047278564 = queryNorm
        0.2540624 = fieldWeight in 1107, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1107)
    0.016013984 = product of:
      0.032027967 = sum of:
        0.032027967 = weight(_text_:22 in 1107) [ClassicSimilarity], result of:
          0.032027967 = score(doc=1107,freq=2.0), product of:
            0.16556148 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.047278564 = queryNorm
            0.19345059 = fieldWeight in 1107, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1107)
      0.5 = coord(1/2)
  0.5 = coord(2/4)
```
Abstract

Retrieval of disease information is often based on several key aspects such as etiology, diagnosis, treatment, prevention, and symptoms of diseases. Automatic identification of disease aspect information is thus essential. In this article, I model the aspect identification problem as a text classification (TC) problem in which a disease aspect corresponds to a category. The disease aspect classification problem poses two challenges to classifiers: (a) a medical text often contains information about multiple aspects of a disease and hence produces noise for the classifiers and (b) text classifiers often cannot extract the textual parts (i.e., passages) about the categories of interest. I thus develop a technique, PETC (Passage Extractor for Text Classification), that extracts passages (from medical texts) for the underlying text classifiers to classify. Case studies on thousands of Chinese and English medical texts show that PETC enhances a support vector machine (SVM) classifier in classifying disease aspect information. PETC also performs better than three state-of-the-art classifier enhancement techniques, including two passage extraction techniques for text classifiers and a technique that employs term proximity information to enhance text classifiers. The contribution is of significance to evidence-based medicine, health education, and healthcare decision support. PETC can be used in those application domains in which a text to be classified may have several parts about different categories.

Date

28.10.2013 19:22:57

Source

Journal of the American Society for Information Science and Technology. 64(2013) no.11, S.2265-2277
Zhu, W.Z.; Allen, R.B.: Document clustering using the LSI subspace signature model (2013) 0.02
```
0.0174208 = product of:
  0.0348416 = sum of:
    0.015624823 = weight(_text_:for in 690) [ClassicSimilarity], result of:
      0.015624823 = score(doc=690,freq=4.0), product of:
        0.08876751 = queryWeight, product of:
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.047278564 = queryNorm
        0.17601961 = fieldWeight in 690, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.046875 = fieldNorm(doc=690)
    0.019216778 = product of:
      0.038433556 = sum of:
        0.038433556 = weight(_text_:22 in 690) [ClassicSimilarity], result of:
          0.038433556 = score(doc=690,freq=2.0), product of:
            0.16556148 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.047278564 = queryNorm
            0.23214069 = fieldWeight in 690, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=690)
      0.5 = coord(1/2)
  0.5 = coord(2/4)
```
Abstract

We describe the latent semantic indexing subspace signature model (LSISSM) for semantic content representation of unstructured text. Grounded on singular value decomposition, the model represents terms and documents by the distribution signatures of their statistical contribution across the top-ranking latent concept dimensions. LSISSM matches term signatures with document signatures according to their mapping coherence between latent semantic indexing (LSI) term subspace and LSI document subspace. LSISSM does feature reduction and finds a low-rank approximation of scalable and sparse term-document matrices. Experiments demonstrate that this approach significantly improves the performance of major clustering algorithms such as standard K-means and self-organizing maps compared with the vector space model and the traditional LSI model. The unique contribution ranking mechanism in LSISSM also improves the initialization of standard K-means compared with random seeding procedure, which sometimes causes low efficiency and effectiveness of clustering. A two-stage initialization strategy based on LSISSM significantly reduces the running time of standard K-means procedures.

Date

23. 3.2013 13:22:36

Source

Journal of the American Society for Information Science and Technology. 64(2013) no.4, S.844-860

HaCohen-Kerner, Y. et al.: Classification using various machine learning methods and combinations of key-phrases and visual features (2016) 0.01

0.008006992 = product of:
  0.032027967 = sum of:
    0.032027967 = product of:
      0.064055935 = sum of:
        0.064055935 = weight(_text_:22 in 2748) [ClassicSimilarity], result of:
          0.064055935 = score(doc=2748,freq=2.0), product of:
            0.16556148 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.047278564 = queryNorm
            0.38690117 = fieldWeight in 2748, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=2748)
      0.5 = coord(1/2)
  0.25 = coord(1/4)

Date: 1. 2.2016 18:25:22

HaCohen-Kerner, Y.; Beck, H.; Yehudai, E.; Rosenstein, M.; Mughaz, D.: Cuisine : classification using stylistic feature sets and/or name-based feature sets (2010) 0.01
```
0.0069052614 = product of:
  0.027621046 = sum of:
    0.027621046 = weight(_text_:for in 3706) [ClassicSimilarity], result of:
      0.027621046 = score(doc=3706,freq=18.0), product of:
        0.08876751 = queryWeight, product of:
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.047278564 = queryNorm
        0.31116164 = fieldWeight in 3706, product of:
          4.2426405 = tf(freq=18.0), with freq of:
            18.0 = termFreq=18.0
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3706)
  0.25 = coord(1/4)
```
Abstract

Document classification presents challenges due to the large number of features, their dependencies, and the large number of training documents. In this research, we investigated the use of six stylistic feature sets (including 42 features) and/or six name-based feature sets (including 234 features) for various combinations of the following classification tasks: ethnic groups of the authors and/or periods of time when the documents were written and/or places where the documents were written. The investigated corpus contains Jewish Law articles written in Hebrew-Aramaic, which present interesting problems for classification. Our system CUISINE (Classification UsIng Stylistic feature sets and/or NamE-based feature sets) achieves accuracy results between 90.71 to 98.99% for the seven classification experiments (ethnicity, time, place, ethnicity&time, ethnicity&place, time&place, ethnicity&time&place). For the first six tasks, the stylistic feature sets in general and the quantitative feature set in particular are enough for excellent classification results. In contrast, the name-based feature sets are rather poor for these tasks. However, for the most complex task (ethnicity&time&place), a hill-climbing model using all feature sets succeeds in significantly improving the classification results. Most of the stylistic features (34 of 42) are language-independent and domain-independent. These features might be useful to the community at large, at least for rather simple tasks.

Source

Journal of the American Society for Information Science and Technology. 61(2010) no.8, S.1644-1657
Golub, K.: Automated subject classification of textual documents in the context of Web-based hierarchical browsing (2011) 0.01
```
0.0067657465 = product of:
  0.027062986 = sum of:
    0.027062986 = weight(_text_:for in 4558) [ClassicSimilarity], result of:
      0.027062986 = score(doc=4558,freq=12.0), product of:
        0.08876751 = queryWeight, product of:
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.047278564 = queryNorm
        0.3048749 = fieldWeight in 4558, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.046875 = fieldNorm(doc=4558)
  0.25 = coord(1/4)
```
Abstract

While automated methods for information organization have been around for several decades now, exponential growth of the World Wide Web has put them into the forefront of research in different communities, within which several approaches can be identified: 1) machine learning (algorithms that allow computers to improve their performance based on learning from pre-existing data); 2) document clustering (algorithms for unsupervised document organization and automated topic extraction); and 3) string matching (algorithms that match given strings within larger text). Here the aim was to automatically organize textual documents into hierarchical structures for subject browsing. The string-matching approach was tested using a controlled vocabulary (containing pre-selected and pre-defined authorized terms, each corresponding to only one concept). The results imply that an appropriate controlled vocabulary, with a sufficient number of entry terms designating classes, could in itself be a solution for automated classification. Then, if the same controlled vocabulary had an appropriat hierarchical structure, it would at the same time provide a good browsing structure for the collection of automatically classified documents.
Fagni, T.; Sebastiani, F.: Selecting negative examples for hierarchical text classification: An experimental comparison (2010) 0.01
```
0.006510343 = product of:
  0.026041372 = sum of:
    0.026041372 = weight(_text_:for in 4101) [ClassicSimilarity], result of:
      0.026041372 = score(doc=4101,freq=16.0), product of:
        0.08876751 = queryWeight, product of:
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.047278564 = queryNorm
        0.29336601 = fieldWeight in 4101, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4101)
  0.25 = coord(1/4)
```
Abstract

Hierarchical text classification (HTC) approaches have recently attracted a lot of interest on the part of researchers in human language technology and machine learning, since they have been shown to bring about equal, if not better, classification accuracy with respect to their "flat" counterparts while allowing exponential time savings at both learning and classification time. A typical component of HTC methods is a "local" policy for selecting negative examples: Given a category c, its negative training examples are by default identified with the training examples that are negative for c and positive for the categories which are siblings of c in the hierarchy. However, this policy has always been taken for granted and never been subjected to careful scrutiny since first proposed 15 years ago. This article proposes a thorough experimental comparison between this policy and three other policies for the selection of negative examples in HTC contexts, one of which (BEST LOCAL (k)) is being proposed for the first time in this article. We compare these policies on the hierarchical versions of three supervised learning algorithms (boosting, support vector machines, and naïve Bayes) by performing experiments on two standard TC datasets, REUTERS-21578 and RCV1-V2.

Source

Journal of the American Society for Information Science and Technology. 61(2010) no.11, S.2256-2265
Yilmaz, T.; Ozcan, R.; Altingovde, I.S.; Ulusoy, Ö.: Improving educational web search for question-like queries through subject classification (2019) 0.01
```
0.005638122 = product of:
  0.022552488 = sum of:
    0.022552488 = weight(_text_:for in 5041) [ClassicSimilarity], result of:
      0.022552488 = score(doc=5041,freq=12.0), product of:
        0.08876751 = queryWeight, product of:
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.047278564 = queryNorm
        0.2540624 = fieldWeight in 5041, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5041)
  0.25 = coord(1/4)
```
Abstract

Students use general web search engines as their primary source of research while trying to find answers to school-related questions. Although search engines are highly relevant for the general population, they may return results that are out of educational context. Another rising trend; social community question answering websites are the second choice for students who try to get answers from other peers online. We attempt discovering possible improvements in educational search by leveraging both of these information sources. For this purpose, we first implement a classifier for educational questions. This classifier is built by an ensemble method that employs several regular learning algorithms and retrieval based approaches that utilize external resources. We also build a query expander to facilitate classification. We further improve the classification using search engine results and obtain 83.5% accuracy. Although our work is entirely based on the Turkish language, the features could easily be mapped to other languages as well. In order to find out whether search engine ranking can be improved in the education domain using the classification model, we collect and label a set of query results retrieved from a general web search engine. We propose five ad-hoc methods to improve search ranking based on the idea that the query-document category relation is an indicator of relevance. We evaluate these methods for overall performance, varying query length and based on factoid and non-factoid queries. We show that some of the methods significantly improve the rankings in the education domain.
Schaalje, G.B.; Blades, N.J.; Funai, T.: ¬An open-set size-adjusted Bayesian classifier for authorship attribution (2013) 0.01
```
0.0055242092 = product of:
  0.022096837 = sum of:
    0.022096837 = weight(_text_:for in 1041) [ClassicSimilarity], result of:
      0.022096837 = score(doc=1041,freq=8.0), product of:
        0.08876751 = queryWeight, product of:
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.047278564 = queryNorm
        0.24892932 = fieldWeight in 1041, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.046875 = fieldNorm(doc=1041)
  0.25 = coord(1/4)
```
Abstract

Recent studies of authorship attribution have used machine-learning methods including regularized multinomial logistic regression, neural nets, support vector machines, and the nearest shrunken centroid classifier to identify likely authors of disputed texts. These methods are all limited by an inability to perform open-set classification and account for text and corpus size. We propose a customized Bayesian logit-normal-beta-binomial classification model for supervised authorship attribution. The model is based on the beta-binomial distribution with an explicit inverse relationship between extra-binomial variation and text size. The model internally estimates the relationship of extra-binomial variation to text size, and uses Markov Chain Monte Carlo (MCMC) to produce distributions of posterior authorship probabilities instead of point estimates. We illustrate the method by training the machine-learning methods as well as the open-set Bayesian classifier on undisputed papers of The Federalist, and testing the method on documents historically attributed to Alexander Hamilton, John Jay, and James Madison. The Bayesian classifier was the best classifier of these texts.

Source

Journal of the American Society for Information Science and Technology. 64(2013) no.9, S.1815-1825
Sojka, P.; Lee, M.; Rehurek, R.; Hatlapatka, R.; Kucbel, M.; Bouche, T.; Goutorbe, C.; Anghelache, R.; Wojciechowski, K.: Toolset for entity and semantic associations : Final Release (2013) 0.01
```
0.0055242092 = product of:
  0.022096837 = sum of:
    0.022096837 = weight(_text_:for in 1057) [ClassicSimilarity], result of:
      0.022096837 = score(doc=1057,freq=8.0), product of:
        0.08876751 = queryWeight, product of:
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.047278564 = queryNorm
        0.24892932 = fieldWeight in 1057, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.046875 = fieldNorm(doc=1057)
  0.25 = coord(1/4)
```
Abstract

In this document we describe the final release of the toolset for entity and semantic associations, integrating two versions (language dependent and language independent) of Unsupervised Document Similarity implemented by MU (using gensim tool) and Citation Indexing, Resolution and Matching (UJF/CMD). We give a brief description of tools, the rationale behind decisions made, and provide elementary evaluation. Tools are integrated in the main project result, EuDML website, and they deliver the needed functionality for exploratory searching and browsing the collected documents. EuDML users and content providers thus benefit from millions of algorithmically generated similarity and citation links, developed using state of the art machine learning and matching methods.

Content

Vgl. auch: https://is.muni.cz/repo/1076213/en/Lee-Sojka-Rehurek-Bolikowski/Toolset-for-Entity-and-Semantic-Associations-Initial-Release-Deliverable-82-of-project-EuDML?lang=en.
Aphinyanaphongs, Y.; Fu, L.D.; Li, Z.; Peskin, E.R.; Efstathiadis, E.; Aliferis, C.F.; Statnikov, A.: ¬A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization (2014) 0.01
```
0.0055242092 = product of:
  0.022096837 = sum of:
    0.022096837 = weight(_text_:for in 1496) [ClassicSimilarity], result of:
      0.022096837 = score(doc=1496,freq=8.0), product of:
        0.08876751 = queryWeight, product of:
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.047278564 = queryNorm
        0.24892932 = fieldWeight in 1496, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.046875 = fieldNorm(doc=1496)
  0.25 = coord(1/4)
```
Abstract

An important aspect to performing text categorization is selecting appropriate supervised classification and feature selection methods. A comprehensive benchmark is needed to inform best practices in this broad application field. Previous benchmarks have evaluated performance for a few supervised classification and feature selection methods and limited ways to optimize them. The present work updates prior benchmarks by increasing the number of classifiers and feature selection methods order of magnitude, including adding recently developed, state-of-the-art methods. Specifically, this study used 229 text categorization data sets/tasks, and evaluated 28 classification methods (both well-established and proprietary/commercial) and 19 feature selection methods according to 4 classification performance metrics. We report several key findings that will be helpful in establishing best methodological practices for text categorization.

Source

Journal of the Association for Information Science and Technology. 65(2014) no.10, S.1964-1987
Ko, Y.: ¬A new term-weighting scheme for text classification using the odds of positive and negative class probabilities (2015) 0.01
```
0.0055242092 = product of:
  0.022096837 = sum of:
    0.022096837 = weight(_text_:for in 2339) [ClassicSimilarity], result of:
      0.022096837 = score(doc=2339,freq=8.0), product of:
        0.08876751 = queryWeight, product of:
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.047278564 = queryNorm
        0.24892932 = fieldWeight in 2339, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.046875 = fieldNorm(doc=2339)
  0.25 = coord(1/4)
```
Abstract

Text classification (TC) is a core technique for text mining and information retrieval. It has been applied to many applications in many different research and industrial areas. Term-weighting schemes assign an appropriate weight to each term to obtain a high TC performance. Although term weighting is one of the important modules for TC and TC has different peculiarities from those in information retrieval, many term-weighting schemes used in information retrieval, such as term frequency-inverse document frequency (tf-idf), have been used in TC in the same manner. The peculiarity of TC that differs most from information retrieval is the existence of class information. This article proposes a new term-weighting scheme that uses class information using positive and negative class distributions. As a result, the proposed scheme, log tf-TRR, consistently performs better than do other schemes using class information as well as traditional schemes such as tf-idf.

Source

Journal of the Association for Information Science and Technology. 66(2015) no.12, S.2553-2565
Teich, E.; Degaetano-Ortlieb, S.; Fankhauser, P.; Kermes, H.; Lapshinova-Koltunski, E.: ¬The linguistic construal of disciplinarity : a data-mining approach using register features (2016) 0.01
```
0.0055242092 = product of:
  0.022096837 = sum of:
    0.022096837 = weight(_text_:for in 3015) [ClassicSimilarity], result of:
      0.022096837 = score(doc=3015,freq=8.0), product of:
        0.08876751 = queryWeight, product of:
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.047278564 = queryNorm
        0.24892932 = fieldWeight in 3015, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.046875 = fieldNorm(doc=3015)
  0.25 = coord(1/4)
```
Abstract

We analyze the linguistic evolution of selected scientific disciplines over a 30-year time span (1970s to 2000s). Our focus is on four highly specialized disciplines at the boundaries of computer science that emerged during that time: computational linguistics, bioinformatics, digital construction, and microelectronics. Our analysis is driven by the question whether these disciplines develop a distinctive language use-both individually and collectively-over the given time period. The data set is the English Scientific Text Corpus (scitex), which includes texts from the 1970s/1980s and early 2000s. Our theoretical basis is register theory. In terms of methods, we combine corpus-based methods of feature extraction (various aggregated features [part-of-speech based], n-grams, lexico-grammatical patterns) and automatic text classification. The results of our research are directly relevant to the study of linguistic variation and languages for specific purposes (LSP) and have implications for various natural language processing (NLP) tasks, for example, authorship attribution, text mining, or training NLP tools.

Source

Journal of the Association for Information Science and Technology. 67(2016) no.7, S.1668-1678
Alberts, I.; Forest, D.: Email pragmatics and automatic classification : a study in the organizational context (2012) 0.01
```
0.0051468783 = product of:
  0.020587513 = sum of:
    0.020587513 = weight(_text_:for in 238) [ClassicSimilarity], result of:
      0.020587513 = score(doc=238,freq=10.0), product of:
        0.08876751 = queryWeight, product of:
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.047278564 = queryNorm
        0.2319262 = fieldWeight in 238, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.0390625 = fieldNorm(doc=238)
  0.25 = coord(1/4)
```
Abstract

This paper presents a two-phased research project aiming to improve email triage for public administration managers. The first phase developed a typology of email classification patterns through a qualitative study involving 34 participants. Inspired by the fields of pragmatics and speech act theory, this typology comprising four top level categories and 13 subcategories represents the typical email triage behaviors of managers in an organizational context. The second study phase was conducted on a corpus of 1,703 messages using email samples of two managers. Using the k-NN (k-nearest neighbor) algorithm, statistical treatments automatically classified the email according to lexical and nonlexical features representative of managers' triage patterns. The automatic classification of email according to the lexicon of the messages was found to be substantially more efficient when k = 2 and n = 2,000. For four categories, the average recall rate was 94.32%, the average precision rate was 94.50%, and the accuracy rate was 94.54%. For 13 categories, the average recall rate was 91.09%, the average precision rate was 84.18%, and the accuracy rate was 88.70%. It appears that a message's nonlexical features are also deeply influenced by email pragmatics. Features related to the recipient and the sender were the most relevant for characterizing email.

Source

Journal of the American Society for Information Science and Technology. 63(2012) no.5, S.904-922
Ru, C.; Tang, J.; Li, S.; Xie, S.; Wang, T.: Using semantic similarity to reduce wrong labels in distant supervision for relation extraction (2018) 0.01
```
0.0051468783 = product of:
  0.020587513 = sum of:
    0.020587513 = weight(_text_:for in 5055) [ClassicSimilarity], result of:
      0.020587513 = score(doc=5055,freq=10.0), product of:
        0.08876751 = queryWeight, product of:
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.047278564 = queryNorm
        0.2319262 = fieldWeight in 5055, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5055)
  0.25 = coord(1/4)
```
Abstract

Distant supervision (DS) has the advantage of automatically generating large amounts of labelled training data and has been widely used for relation extraction. However, there are usually many wrong labels in the automatically labelled data in distant supervision (Riedel, Yao, & McCallum, 2010). This paper presents a novel method to reduce the wrong labels. The proposed method uses the semantic Jaccard with word embedding to measure the semantic similarity between the relation phrase in the knowledge base and the dependency phrases between two entities in a sentence to filter the wrong labels. In the process of reducing wrong labels, the semantic Jaccard algorithm selects a core dependency phrase to represent the candidate relation in a sentence, which can capture features for relation classification and avoid the negative impact from irrelevant term sequences that previous neural network models of relation extraction often suffer. In the process of relation classification, the core dependency phrases are also used as the input of a convolutional neural network (CNN) for relation classification. The experimental results show that compared with the methods using original DS data, the methods using filtered DS data performed much better in relation extraction. It indicates that the semantic similarity based method is effective in reducing wrong labels. The relation extraction performance of the CNN model using the core dependency phrases as input is the best of all, which indicates that using the core dependency phrases as input of CNN is enough to capture the features for relation classification and could avoid negative impact from irrelevant terms.
Liu, R.-L.: Context-based term frequency assessment for text classification (2010) 0.00
```
0.004784106 = product of:
  0.019136423 = sum of:
    0.019136423 = weight(_text_:for in 3331) [ClassicSimilarity], result of:
      0.019136423 = score(doc=3331,freq=6.0), product of:
        0.08876751 = queryWeight, product of:
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.047278564 = queryNorm
        0.21557912 = fieldWeight in 3331, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.046875 = fieldNorm(doc=3331)
  0.25 = coord(1/4)
```
Abstract

Automatic text classification (TC) is essential for the management of information. To properly classify a document d, it is essential to identify the semantics of each term t in d, while the semantics heavily depend on context (neighboring terms) of t in d. Therefore, we present a technique CTFA (Context-based Term Frequency Assessment) that improves text classifiers by considering term contexts in test documents. The results of the term context recognition are used to assess term frequencies of terms, and hence CTFA may easily work with various kinds of text classifiers that base their TC decisions on term frequencies, without needing to modify the classifiers. Moreover, CTFA is efficient, and neither huge memory nor domain-specific knowledge is required. Empirical results show that CTFA successfully enhances performance of several kinds of text classifiers on different experimental data.

Source

Journal of the American Society for Information Science and Technology. 61(2010) no.2, S.300-309
Cortez, E.; Herrera, M.R.; Silva, A.S. da; Moura, E.S. de; Neubert, M.: Lightweight methods for large-scale product categorization (2011) 0.00
```
0.004784106 = product of:
  0.019136423 = sum of:
    0.019136423 = weight(_text_:for in 4758) [ClassicSimilarity], result of:
      0.019136423 = score(doc=4758,freq=6.0), product of:
        0.08876751 = queryWeight, product of:
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.047278564 = queryNorm
        0.21557912 = fieldWeight in 4758, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.8775425 = idf(docFreq=18385, maxDocs=44218)
          0.046875 = fieldNorm(doc=4758)
  0.25 = coord(1/4)
```
Abstract

In this article, we present a study about classification methods for large-scale categorization of product offers on e-shopping web sites. We present a study about the performance of previously proposed approaches and deployed a probabilistic approach to model the classification problem. We also studied an alternative way of modeling information about the description of product offers and investigated the usage of price and store of product offers as features adopted in the classification process. Our experiments used two collections of over a million product offers previously categorized by human editors and taxonomies of hundreds of categories from a real e-shopping web site. In these experiments, our method achieved an improvement of up to 9% in the quality of the categorization in comparison with the best baseline we have found.

Source

Journal of the American Society for Information Science and Technology. 62(2011) no.9, S.1839-1848

Search (42 results, page 1 of 3)

Authors

Types

Themes