Search (44 results, page 1 of 3)

Golub, K.: Automated subject classification of textual documents in the context of Web-based hierarchical browsing (2011) 0.02

0.015901545 = product of:
  0.07420721 = sum of:
    0.03856498 = weight(_text_:wide in 4558) [ClassicSimilarity], result of:
      0.03856498 = score(doc=4558,freq=2.0), product of:
        0.1312982 = queryWeight, product of:
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.029633347 = queryNorm
        0.29372054 = fieldWeight in 4558, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.046875 = fieldNorm(doc=4558)
    0.029588435 = weight(_text_:web in 4558) [ClassicSimilarity], result of:
      0.029588435 = score(doc=4558,freq=4.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.3059541 = fieldWeight in 4558, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=4558)
    0.0060537956 = weight(_text_:information in 4558) [ClassicSimilarity], result of:
      0.0060537956 = score(doc=4558,freq=2.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.116372846 = fieldWeight in 4558, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=4558)
  0.21428572 = coord(3/14)

Abstract: While automated methods for information organization have been around for several decades now, exponential growth of the World Wide Web has put them into the forefront of research in different communities, within which several approaches can be identified: 1) machine learning (algorithms that allow computers to improve their performance based on learning from pre-existing data); 2) document clustering (algorithms for unsupervised document organization and automated topic extraction); and 3) string matching (algorithms that match given strings within larger text). Here the aim was to automatically organize textual documents into hierarchical structures for subject browsing. The string-matching approach was tested using a controlled vocabulary (containing pre-selected and pre-defined authorized terms, each corresponding to only one concept). The results imply that an appropriate controlled vocabulary, with a sufficient number of entry terms designating classes, could in itself be a solution for automated classification. Then, if the same controlled vocabulary had an appropriat hierarchical structure, it would at the same time provide a good browsing structure for the collection of automatically classified documents.

Egbert, J.; Biber, D.; Davies, M.: Developing a bottom-up, user-based method of web register classification (2015) 0.02

0.015698675 = product of:
  0.073260486 = sum of:
    0.05917687 = weight(_text_:web in 2158) [ClassicSimilarity], result of:
      0.05917687 = score(doc=2158,freq=16.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.6119082 = fieldWeight in 2158, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=2158)
    0.0060537956 = weight(_text_:information in 2158) [ClassicSimilarity], result of:
      0.0060537956 = score(doc=2158,freq=2.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.116372846 = fieldWeight in 2158, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=2158)
    0.008029819 = product of:
      0.024089456 = sum of:
        0.024089456 = weight(_text_:22 in 2158) [ClassicSimilarity], result of:
          0.024089456 = score(doc=2158,freq=2.0), product of:
            0.103770934 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.029633347 = queryNorm
            0.23214069 = fieldWeight in 2158, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=2158)
      0.33333334 = coord(1/3)
  0.21428572 = coord(3/14)

Abstract: This paper introduces a project to develop a reliable, cost-effective method for classifying Internet texts into register categories, and apply that approach to the analysis of a large corpus of web documents. To date, the project has proceeded in 2 key phases. First, we developed a bottom-up method for web register classification, asking end users of the web to utilize a decision-tree survey to code relevant situational characteristics of web documents, resulting in a bottom-up identification of register and subregister categories. We present details regarding the development and testing of this method through a series of 10 pilot studies. Then, in the second phase of our project we applied this procedure to a corpus of 53,000 web documents. An analysis of the results demonstrates the effectiveness of these methods for web register classification and provides a preliminary description of the types and distribution of registers on the web.
Date: 4. 8.2015 19:22:04
Source: Journal of the Association for Information Science and Technology. 66(2015) no.9, S.1817-1831

Groß, T.; Faden, M.: Automatische Indexierung elektronischer Dokumente an der Deutschen Zentralbibliothek für Wirtschaftswissenschaften : Bericht über die Jahrestagung der Internationalen Buchwissenschaftlichen Gesellschaft (2010) 0.01
```
0.013228344 = product of:
  0.06173227 = sum of:
    0.025709987 = weight(_text_:wide in 4051) [ClassicSimilarity], result of:
      0.025709987 = score(doc=4051,freq=2.0), product of:
        0.1312982 = queryWeight, product of:
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.029633347 = queryNorm
        0.1958137 = fieldWeight in 4051, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.03125 = fieldNorm(doc=4051)
    0.013948122 = weight(_text_:web in 4051) [ClassicSimilarity], result of:
      0.013948122 = score(doc=4051,freq=2.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.14422815 = fieldWeight in 4051, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.03125 = fieldNorm(doc=4051)
    0.022074163 = weight(_text_:bibliothek in 4051) [ClassicSimilarity], result of:
      0.022074163 = score(doc=4051,freq=2.0), product of:
        0.121660605 = queryWeight, product of:
          4.1055303 = idf(docFreq=1980, maxDocs=44218)
          0.029633347 = queryNorm
        0.18144052 = fieldWeight in 4051, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.1055303 = idf(docFreq=1980, maxDocs=44218)
          0.03125 = fieldNorm(doc=4051)
  0.21428572 = coord(3/14)
```
Abstract

Die zunehmende Verfügbarmachung digitaler Informationen in den letzten Jahren sowie die Aussicht auf ein weiteres Ansteigen der sogenannten Datenflut kumulieren in einem grundlegenden, sich weiter verstärkenden Informationsstrukturierungsproblem. Die stetige Zunahme von digitalen Informationsressourcen im World Wide Web sichert zwar jederzeit und ortsungebunden den Zugriff auf verschiedene Informationen; offen bleibt der strukturierte Zugang, insbesondere zu wissenschaftlichen Ressourcen. Angesichts der steigenden Anzahl elektronischer Inhalte und vor dem Hintergrund stagnierender bzw. knapper werdender personeller Ressourcen in der Sacherschließun schafft keine Bibliothek bzw. kein Bibliotheksverbund es mehr, weder aktuell noch zukünftig, alle digitalen Daten zu erfassen, zu strukturieren und zueinander in Beziehung zu setzen. In der Informationsgesellschaft des 21. Jahrhunderts wird es aber zunehmend wichtiger, die in der Flut verschwundenen wissenschaftlichen Informationen zeitnah, angemessen und vollständig zu strukturieren und somit als Basis für eine Wissensgenerierung wieder nutzbar zu machen. Eine normierte Inhaltserschließung digitaler Informationsressourcen ist deshalb für die Deutsche Zentralbibliothek für Wirtschaftswissenschaften (ZBW) als wichtige Informationsinfrastruktureinrichtung in diesem Bereich ein entscheidender und auch erfolgskritischer Aspekt im Wettbewerb mit anderen Informationsdienstleistern. Weil die traditionelle intellektuelle Sacherschließung aber nicht beliebig skalierbar ist - mit dem Anstieg der Zahl an Online-Dokumenten steigt proportional auch der personelle Ressourcenbedarf an Fachreferenten, wenn ein gewisser Qualitätsstandard gehalten werden soll - bedarf es zukünftig anderer Sacherschließungsverfahren. Automatisierte Verschlagwortungsmethoden werden dabei als einzige Möglichkeit angesehen, die bibliothekarische Sacherschließung auch im digitalen Zeitalter zukunftsfest auszugestalten. Zudem können maschinelle Ansätze dazu beitragen, die Heterogenitäten (Indexierungsinkonsistenzen) zwischen den einzelnen Sacherschließer zu nivellieren, und somit zu einer homogeneren Erschließung des Bibliotheksbestandes beitragen.
Yilmaz, T.; Ozcan, R.; Altingovde, I.S.; Ulusoy, Ö.: Improving educational web search for question-like queries through subject classification (2019) 0.01
```
0.011209717 = product of:
  0.052312013 = sum of:
    0.03019857 = weight(_text_:web in 5041) [ClassicSimilarity], result of:
      0.03019857 = score(doc=5041,freq=6.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.3122631 = fieldWeight in 5041, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5041)
    0.0071344664 = weight(_text_:information in 5041) [ClassicSimilarity], result of:
      0.0071344664 = score(doc=5041,freq=4.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.13714671 = fieldWeight in 5041, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5041)
    0.014978974 = weight(_text_:retrieval in 5041) [ClassicSimilarity], result of:
      0.014978974 = score(doc=5041,freq=2.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.16710453 = fieldWeight in 5041, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5041)
  0.21428572 = coord(3/14)
```
Abstract

Students use general web search engines as their primary source of research while trying to find answers to school-related questions. Although search engines are highly relevant for the general population, they may return results that are out of educational context. Another rising trend; social community question answering websites are the second choice for students who try to get answers from other peers online. We attempt discovering possible improvements in educational search by leveraging both of these information sources. For this purpose, we first implement a classifier for educational questions. This classifier is built by an ensemble method that employs several regular learning algorithms and retrieval based approaches that utilize external resources. We also build a query expander to facilitate classification. We further improve the classification using search engine results and obtain 83.5% accuracy. Although our work is entirely based on the Turkish language, the features could easily be mapped to other languages as well. In order to find out whether search engine ranking can be improved in the education domain using the classification model, we collect and label a set of query results retrieved from a general web search engine. We propose five ad-hoc methods to improve search ranking based on the idea that the query-document category relation is an indicator of relevance. We evaluate these methods for overall performance, varying query length and based on factoid and non-factoid queries. We show that some of the methods significantly improve the rankings in the education domain.

Source

Information processing and management. 56(2019) no.1, S.228-246
Smiraglia, R.P.; Cai, X.: Tracking the evolution of clustering, machine learning, automatic indexing and automatic classification in knowledge organization (2017) 0.01
```
0.008026919 = product of:
  0.037458956 = sum of:
    0.017435152 = weight(_text_:web in 3627) [ClassicSimilarity], result of:
      0.017435152 = score(doc=3627,freq=2.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.18028519 = fieldWeight in 3627, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3627)
    0.0050448296 = weight(_text_:information in 3627) [ClassicSimilarity], result of:
      0.0050448296 = score(doc=3627,freq=2.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.09697737 = fieldWeight in 3627, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3627)
    0.014978974 = weight(_text_:retrieval in 3627) [ClassicSimilarity], result of:
      0.014978974 = score(doc=3627,freq=2.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.16710453 = fieldWeight in 3627, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3627)
  0.21428572 = coord(3/14)
```
Abstract

A very important extension of the traditional domain of knowledge organization (KO) arises from attempts to incorporate techniques devised in the computer science domain for automatic concept extraction and for grouping, categorizing, clustering and otherwise organizing knowledge using mechanical means. Four specific terms have emerged to identify the most prevalent techniques: machine learning, clustering, automatic indexing, and automatic classification. Our study presents three domain analytical case analyses in search of answers. The first case relies on citations located using the ISKO-supported "Knowledge Organization Bibliography." The second case relies on works in both Web of Science and SCOPUS. Case three applies co-word analysis and citation analysis to the contents of the papers in the present special issue. We observe scholars involved in "clustering" and "automatic classification" who share common thematic emphases. But we have found no coherence, no common activity and no social semantics. We have not found a research front, or a common teleology within the KO domain. We also have found a lively group of authors who have succeeded in submitting papers to this special issue, and their work quite interestingly aligns with the case studies we report. There is an emphasis on KO for information retrieval; there is much work on clustering (which involves conceptual points within texts) and automatic classification (which involves semantic groupings at the meta-document level).
Ko, Y.: ¬A new term-weighting scheme for text classification using the odds of positive and negative class probabilities (2015) 0.01
```
0.007581752 = product of:
  0.05307226 = sum of:
    0.01712272 = weight(_text_:information in 2339) [ClassicSimilarity], result of:
      0.01712272 = score(doc=2339,freq=16.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.3291521 = fieldWeight in 2339, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=2339)
    0.03594954 = weight(_text_:retrieval in 2339) [ClassicSimilarity], result of:
      0.03594954 = score(doc=2339,freq=8.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.40105087 = fieldWeight in 2339, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.046875 = fieldNorm(doc=2339)
  0.14285715 = coord(2/14)
```
Abstract

Text classification (TC) is a core technique for text mining and information retrieval. It has been applied to many applications in many different research and industrial areas. Term-weighting schemes assign an appropriate weight to each term to obtain a high TC performance. Although term weighting is one of the important modules for TC and TC has different peculiarities from those in information retrieval, many term-weighting schemes used in information retrieval, such as term frequency-inverse document frequency (tf-idf), have been used in TC in the same manner. The peculiarity of TC that differs most from information retrieval is the existence of class information. This article proposes a new term-weighting scheme that uses class information using positive and negative class distributions. As a result, the proposed scheme, log tf-TRR, consistently performs better than do other schemes using class information as well as traditional schemes such as tf-idf.

Source

Journal of the Association for Information Science and Technology. 66(2015) no.12, S.2553-2565
Liu, R.-L.: ¬A passage extractor for classification of disease aspect information (2013) 0.01
```
0.007503826 = product of:
  0.035017855 = sum of:
    0.013347364 = weight(_text_:information in 1107) [ClassicSimilarity], result of:
      0.013347364 = score(doc=1107,freq=14.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.256578 = fieldWeight in 1107, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1107)
    0.014978974 = weight(_text_:retrieval in 1107) [ClassicSimilarity], result of:
      0.014978974 = score(doc=1107,freq=2.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.16710453 = fieldWeight in 1107, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1107)
    0.0066915164 = product of:
      0.020074548 = sum of:
        0.020074548 = weight(_text_:22 in 1107) [ClassicSimilarity], result of:
          0.020074548 = score(doc=1107,freq=2.0), product of:
            0.103770934 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.029633347 = queryNorm
            0.19345059 = fieldWeight in 1107, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1107)
      0.33333334 = coord(1/3)
  0.21428572 = coord(3/14)
```
Abstract

Retrieval of disease information is often based on several key aspects such as etiology, diagnosis, treatment, prevention, and symptoms of diseases. Automatic identification of disease aspect information is thus essential. In this article, I model the aspect identification problem as a text classification (TC) problem in which a disease aspect corresponds to a category. The disease aspect classification problem poses two challenges to classifiers: (a) a medical text often contains information about multiple aspects of a disease and hence produces noise for the classifiers and (b) text classifiers often cannot extract the textual parts (i.e., passages) about the categories of interest. I thus develop a technique, PETC (Passage Extractor for Text Classification), that extracts passages (from medical texts) for the underlying text classifiers to classify. Case studies on thousands of Chinese and English medical texts show that PETC enhances a support vector machine (SVM) classifier in classifying disease aspect information. PETC also performs better than three state-of-the-art classifier enhancement techniques, including two passage extraction techniques for text classifiers and a technique that employs term proximity information to enhance text classifiers. The contribution is of significance to evidence-based medicine, health education, and healthcare decision support. PETC can be used in those application domains in which a text to be classified may have several parts about different categories.

Date

28.10.2013 19:22:57

Source

Journal of the American Society for Information Science and Technology. 64(2013) no.11, S.2265-2277
Fang, H.: Classifying research articles in multidisciplinary sciences journals into subject categories (2015) 0.01
```
0.005702162 = product of:
  0.039915133 = sum of:
    0.034870304 = weight(_text_:web in 2194) [ClassicSimilarity], result of:
      0.034870304 = score(doc=2194,freq=8.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.36057037 = fieldWeight in 2194, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2194)
    0.0050448296 = weight(_text_:information in 2194) [ClassicSimilarity], result of:
      0.0050448296 = score(doc=2194,freq=2.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.09697737 = fieldWeight in 2194, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2194)
  0.14285715 = coord(2/14)
```
Abstract

In the Thomson Reuters Web of Science database, the subject categories of a journal are applied to all articles in the journal. However, many articles in multidisciplinary Sciences journals may only be represented by a small number of subject categories. To provide more accurate information on the research areas of articles in such journals, we can classify articles in these journals into subject categories as defined by Web of Science based on their references. For an article in a multidisciplinary sciences journal, the method counts the subject categories in all of the article's references indexed by Web of Science, and uses the most numerous subject categories of the references to determine the most appropriate classification of the article. We used articles in an issue of Proceedings of the National Academy of Sciences (PNAS) to validate the correctness of the method by comparing the obtained results with the categories of the articles as defined by PNAS and their content. This study shows that the method provides more precise search results for the subject category of interest in bibliometric investigations through recognition of articles in multidisciplinary sciences journals whose work relates to a particular subject category.

Object

Web of science
Cortez, E.; Herrera, M.R.; Silva, A.S. da; Moura, E.S. de; Neubert, M.: Lightweight methods for large-scale product categorization (2011) 0.01
```
0.005449971 = product of:
  0.038149796 = sum of:
    0.029588435 = weight(_text_:web in 4758) [ClassicSimilarity], result of:
      0.029588435 = score(doc=4758,freq=4.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.3059541 = fieldWeight in 4758, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=4758)
    0.00856136 = weight(_text_:information in 4758) [ClassicSimilarity], result of:
      0.00856136 = score(doc=4758,freq=4.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.16457605 = fieldWeight in 4758, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=4758)
  0.14285715 = coord(2/14)
```
Abstract

In this article, we present a study about classification methods for large-scale categorization of product offers on e-shopping web sites. We present a study about the performance of previously proposed approaches and deployed a probabilistic approach to model the classification problem. We also studied an alternative way of modeling information about the description of product offers and investigated the usage of price and store of product offers as features adopted in the classification process. Our experiments used two collections of over a million product offers previously categorized by human editors and taxonomies of hundreds of categories from a real e-shopping web site. In these experiments, our method achieved an improvement of up to 9% in the quality of the categorization in comparison with the best baseline we have found.

Source

Journal of the American Society for Information Science and Technology. 62(2011) no.9, S.1839-1848
Golub, K.; Soergel, D.; Buchanan, G.; Tudhope, D.; Lykke, M.; Hiom, D.: ¬A framework for evaluating automatic indexing or classification in the context of retrieval (2016) 0.00
```
0.004954607 = product of:
  0.034682248 = sum of:
    0.008737902 = weight(_text_:information in 3311) [ClassicSimilarity], result of:
      0.008737902 = score(doc=3311,freq=6.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.16796975 = fieldWeight in 3311, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3311)
    0.025944345 = weight(_text_:retrieval in 3311) [ClassicSimilarity], result of:
      0.025944345 = score(doc=3311,freq=6.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.28943354 = fieldWeight in 3311, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3311)
  0.14285715 = coord(2/14)
```
Abstract

Tools for automatic subject assignment help deal with scale and sustainability in creating and enriching metadata, establishing more connections across and between resources and enhancing consistency. Although some software vendors and experimental researchers claim the tools can replace manual subject indexing, hard scientific evidence of their performance in operating information environments is scarce. A major reason for this is that research is usually conducted in laboratory conditions, excluding the complexities of real-life systems and situations. The article reviews and discusses issues with existing evaluation approaches such as problems of aboutness and relevance assessments, implying the need to use more than a single "gold standard" method when evaluating indexing and retrieval, and proposes a comprehensive evaluation framework. The framework is informed by a systematic review of the literature on evaluation approaches: evaluating indexing quality directly through assessment by an evaluator or through comparison with a gold standard, evaluating the quality of computer-assisted indexing directly in the context of an indexing workflow, and evaluating indexing quality indirectly through analyzing retrieval performance.

Series

Advances in information science

Source

Journal of the Association for Information Science and Technology. 67(2016) no.1, S.3-16
AlQenaei, Z.M.; Monarchi, D.E.: ¬The use of learning techniques to analyze the results of a manual classification system (2016) 0.00
```
0.004274482 = product of:
  0.029921371 = sum of:
    0.008737902 = weight(_text_:information in 2836) [ClassicSimilarity], result of:
      0.008737902 = score(doc=2836,freq=6.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.16796975 = fieldWeight in 2836, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2836)
    0.021183468 = weight(_text_:retrieval in 2836) [ClassicSimilarity], result of:
      0.021183468 = score(doc=2836,freq=4.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.23632148 = fieldWeight in 2836, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2836)
  0.14285715 = coord(2/14)
```
Abstract

Classification is the process of assigning objects to pre-defined classes based on observations or characteristics of those objects, and there are many approaches to performing this task. The overall objective of this study is to demonstrate the use of two learning techniques to analyze the results of a manual classification system. Our sample consisted of 1,026 documents, from the ACM Computing Classification System, classified by their authors as belonging to one of the groups of the classification system: "H.3 Information Storage and Retrieval." A singular value decomposition of the documents' weighted term-frequency matrix was used to represent each document in a 50-dimensional vector space. The analysis of the representation using both supervised (decision tree) and unsupervised (clustering) techniques suggests that two pairs of the ACM classes are closely related to each other in the vector space. Class 1 (Content Analysis and Indexing) is closely related to Class 3 (Information Search and Retrieval), and Class 4 (Systems and Software) is closely related to Class 5 (Online Information Services). Further analysis was performed to test the diffusion of the words in the two classes using both cosine and Euclidean distance.
Liu, X.; Yu, S.; Janssens, F.; Glänzel, W.; Moreau, Y.; Moor, B.de: Weighted hybrid clustering by combining text mining and bibliometrics on a large-scale journal database (2010) 0.00
```
0.0042119347 = product of:
  0.029483542 = sum of:
    0.020922182 = weight(_text_:web in 3464) [ClassicSimilarity], result of:
      0.020922182 = score(doc=3464,freq=2.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.21634221 = fieldWeight in 3464, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=3464)
    0.00856136 = weight(_text_:information in 3464) [ClassicSimilarity], result of:
      0.00856136 = score(doc=3464,freq=4.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.16457605 = fieldWeight in 3464, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=3464)
  0.14285715 = coord(2/14)
```
Abstract

We propose a new hybrid clustering framework to incorporate text mining with bibliometrics in journal set analysis. The framework integrates two different approaches: clustering ensemble and kernel-fusion clustering. To improve the flexibility and the efficiency of processing large-scale data, we propose an information-based weighting scheme to leverage the effect of multiple data sources in hybrid clustering. Three different algorithms are extended by the proposed weighting scheme and they are employed on a large journal set retrieved from the Web of Science (WoS) database. The clustering performance of the proposed algorithms is systematically evaluated using multiple evaluation methods, and they were cross-compared with alternative methods. Experimental results demonstrate that the proposed weighted hybrid clustering strategy is superior to other methods in clustering performance and efficiency. The proposed approach also provides a more refined structural mapping of journal sets, which is useful for monitoring and detecting new trends in different scientific fields.

Source

Journal of the American Society for Information Science and Technology. 61(2010) no.6, S.1105-1119
Barthel, S.; Tönnies, S.; Balke, W.-T.: Large-scale experiments for mathematical document classification (2013) 0.00
```
0.0035099457 = product of:
  0.02456962 = sum of:
    0.017435152 = weight(_text_:web in 1056) [ClassicSimilarity], result of:
      0.017435152 = score(doc=1056,freq=2.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.18028519 = fieldWeight in 1056, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1056)
    0.0071344664 = weight(_text_:information in 1056) [ClassicSimilarity], result of:
      0.0071344664 = score(doc=1056,freq=4.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.13714671 = fieldWeight in 1056, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1056)
  0.14285715 = coord(2/14)
```
Abstract

The ever increasing amount of digitally available information is curse and blessing at the same time. On the one hand, users have increasingly large amounts of information at their fingertips. On the other hand, the assessment and refinement of web search results becomes more and more tiresome and difficult for non-experts in a domain. Therefore, established digital libraries offer specialized collections with a certain degree of quality. This quality can largely be attributed to the great effort invested into semantic enrichment of the provided documents e.g. by annotating their documents with respect to a domain-specific taxonomy. This process is still done manually in many domains, e.g. chemistry CAS, medicine MeSH, or mathematics MSC. But due to the growing amount of data, this manual task gets more and more time consuming and expensive. The only solution for this problem seems to employ automated classification algorithms, but from evaluations done in previous research, conclusions to a real world scenario are difficult to make. We therefore conducted a large scale feasibility study on a real world data set from one of the biggest mathematical digital libraries, i.e. Zentralblatt MATH, with special focus on its practical applicability.

Kasprzik, A.: Automatisierte und semiautomatisierte Klassifizierung : eine Analyse aktueller Projekte (2014) 0.00

0.0033447412 = product of:
  0.046826374 = sum of:
    0.046826374 = weight(_text_:bibliothek in 2470) [ClassicSimilarity], result of:
      0.046826374 = score(doc=2470,freq=4.0), product of:
        0.121660605 = queryWeight, product of:
          4.1055303 = idf(docFreq=1980, maxDocs=44218)
          0.029633347 = queryNorm
        0.38489348 = fieldWeight in 2470, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.1055303 = idf(docFreq=1980, maxDocs=44218)
          0.046875 = fieldNorm(doc=2470)
  0.071428575 = coord(1/14)

Content: Vgl.: https://journals.ub.uni-heidelberg.de/index.php/bibliothek/article/view/14022.
Source: Perspektive Bibliothek. 3(2014) H.1, S.85-110

Borodin, Y.; Polishchuk, V.; Mahmud, J.; Ramakrishnan, I.V.; Stent, A.: Live and learn from mistakes : a lightweight system for document classification (2013) 0.00
```
0.003211426 = product of:
  0.022479981 = sum of:
    0.017435152 = weight(_text_:web in 2722) [ClassicSimilarity], result of:
      0.017435152 = score(doc=2722,freq=2.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.18028519 = fieldWeight in 2722, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2722)
    0.0050448296 = weight(_text_:information in 2722) [ClassicSimilarity], result of:
      0.0050448296 = score(doc=2722,freq=2.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.09697737 = fieldWeight in 2722, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2722)
  0.14285715 = coord(2/14)
```
Abstract

We present a Life-Long Learning from Mistakes (3LM) algorithm for document classification, which could be used in various scenarios such as spam filtering, blog classification, and web resource categorization. We extend the ideas of online clustering and batch-mode centroid-based classification to online learning with negative feedback. The 3LM is a competitive learning algorithm, which avoids over-smoothing, characteristic of the centroid-based classifiers, by using a different class representative, which we call clusterhead. The clusterheads competing for vector-space dominance are drawn toward misclassified documents, eventually bringing the model to a "balanced state" for a fixed distribution of documents. Subsequently, the clusterheads oscillate between the misclassified documents, heuristically minimizing the rate of misclassifications, an NP-complete problem. Further, the 3LM algorithm prevents over-fitting by "leashing" the clusterheads to their respective centroids. A clusterhead provably converges if its class can be separated by a hyper-plane from all other classes. Lifelong learning with fixed learning rate allows 3LM to adapt to possibly changing distribution of the data and continually learn and unlearn document classes. We report on our experiments, which demonstrate high accuracy of document classification on Reuters21578, OHSUMED, and TREC07p-spam datasets. The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naïve Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.

Source

Information processing and management. 49(2013) no.1, S.83-98

Ru, C.; Tang, J.; Li, S.; Xie, S.; Wang, T.: Using semantic similarity to reduce wrong labels in distant supervision for relation extraction (2018) 0.00

0.0028605436 = product of:
  0.020023804 = sum of:
    0.0050448296 = weight(_text_:information in 5055) [ClassicSimilarity], result of:
      0.0050448296 = score(doc=5055,freq=2.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.09697737 = fieldWeight in 5055, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5055)
    0.014978974 = weight(_text_:retrieval in 5055) [ClassicSimilarity], result of:
      0.014978974 = score(doc=5055,freq=2.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.16710453 = fieldWeight in 5055, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5055)
  0.14285715 = coord(2/14)

Source: Information processing and management. 54(2018) no.4, S.593-608
Theme: Semantisches Umfeld in Indexierung u. Retrieval

Piros, A.: Automatic interpretation of complex UDC numbers : towards support for library systems (2015) 0.00
```
0.0025691409 = product of:
  0.017983986 = sum of:
    0.013948122 = weight(_text_:web in 2301) [ClassicSimilarity], result of:
      0.013948122 = score(doc=2301,freq=2.0), product of:
        0.09670874 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029633347 = queryNorm
        0.14422815 = fieldWeight in 2301, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.03125 = fieldNorm(doc=2301)
    0.0040358636 = weight(_text_:information in 2301) [ClassicSimilarity], result of:
      0.0040358636 = score(doc=2301,freq=2.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.0775819 = fieldWeight in 2301, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.03125 = fieldNorm(doc=2301)
  0.14285715 = coord(2/14)
```
Abstract

Analytico-synthetic and faceted classifications, such as Universal Decimal Classification (UDC) express content of documents with complex, pre-combined classification codes. Without classification authority control that would help manage and access structured notations, the use of UDC codes in searching and browsing is limited. Existing UDC parsing solutions are usually created for a particular database system or a specific task and are not widely applicable. The approach described in this paper provides a solution by which the analysis and interpretation of UDC notations would be stored into an intermediate format (in this case, in XML) by automatic means without any data or information loss. Due to its richness, the output file can be converted into different formats, such as standard mark-up and data exchange formats or simple lists of the recommended entry points of a UDC number. The program can also be used to create authority records containing complex UDC numbers which can be comprehensively analysed in order to be retrieved effectively. The Java program, as well as the corresponding schema definition it employs, is under continuous development. The current version of the interpreter software is now available online for testing purposes at the following web site: http://interpreter-eto.rhcloud.com. The future plan is to implement conversion methods for standard formats and to create standard online interfaces in order to make it possible to use the features of software as a service. This would result in the algorithm being able to be employed both in existing and future library systems to analyse UDC numbers without any significant programming effort.

Zhu, W.Z.; Allen, R.B.: Document clustering using the LSI subspace signature model (2013) 0.00

0.002011945 = product of:
  0.014083615 = sum of:
    0.0060537956 = weight(_text_:information in 690) [ClassicSimilarity], result of:
      0.0060537956 = score(doc=690,freq=2.0), product of:
        0.052020688 = queryWeight, product of:
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.029633347 = queryNorm
        0.116372846 = fieldWeight in 690, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.7554779 = idf(docFreq=20772, maxDocs=44218)
          0.046875 = fieldNorm(doc=690)
    0.008029819 = product of:
      0.024089456 = sum of:
        0.024089456 = weight(_text_:22 in 690) [ClassicSimilarity], result of:
          0.024089456 = score(doc=690,freq=2.0), product of:
            0.103770934 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.029633347 = queryNorm
            0.23214069 = fieldWeight in 690, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=690)
      0.33333334 = coord(1/3)
  0.14285715 = coord(2/14)

Date: 23. 3.2013 13:22:36
Source: Journal of the American Society for Information Science and Technology. 64(2013) no.4, S.844-860

Golub, K.; Hansson, J.; Soergel, D.; Tudhope, D.: Managing classification in libraries : a methodological outline for evaluating automatic subject indexing and classification in Swedish library catalogues (2015) 0.00
```
0.001513105 = product of:
  0.021183468 = sum of:
    0.021183468 = weight(_text_:retrieval in 2300) [ClassicSimilarity], result of:
      0.021183468 = score(doc=2300,freq=4.0), product of:
        0.08963835 = queryWeight, product of:
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.029633347 = queryNorm
        0.23632148 = fieldWeight in 2300, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.024915 = idf(docFreq=5836, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2300)
  0.071428575 = coord(1/14)
```
Abstract

Subject terms play a crucial role in resource discovery but require substantial effort to produce. Automatic subject classification and indexing address problems of scale and sustainability and can be used to enrich existing bibliographic records, establish more connections across and between resources and enhance consistency of bibliographic data. The paper aims to put forward a complex methodological framework to evaluate automatic classification tools of Swedish textual documents based on the Dewey Decimal Classification (DDC) recently introduced to Swedish libraries. Three major complementary approaches are suggested: a quality-built gold standard, retrieval effects, domain analysis. The gold standard is built based on input from at least two catalogue librarians, end-users expert in the subject, end users inexperienced in the subject and automated tools. Retrieval effects are studied through a combination of assigned and free tasks, including factual and comprehensive types. The study also takes into consideration the different role and character of subject terms in various knowledge domains, such as scientific disciplines. As a theoretical framework, domain analysis is used and applied in relation to the implementation of DDC in Swedish libraries and chosen domains of knowledge within the DDC itself.

HaCohen-Kerner, Y. et al.: Classification using various machine learning methods and combinations of key-phrases and visual features (2016) 0.00

9.5593097E-4 = product of:
  0.013383033 = sum of:
    0.013383033 = product of:
      0.040149096 = sum of:
        0.040149096 = weight(_text_:22 in 2748) [ClassicSimilarity], result of:
          0.040149096 = score(doc=2748,freq=2.0), product of:
            0.103770934 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.029633347 = queryNorm
            0.38690117 = fieldWeight in 2748, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=2748)
      0.33333334 = coord(1/3)
  0.071428575 = coord(1/14)

Date: 1. 2.2016 18:25:22

Search (44 results, page 1 of 3)

Authors

Languages

Types

Themes