Search (61 results, page 1 of 4)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.23

0.22834091 = product of:
  0.45668182 = sum of:
    0.06293926 = product of:
      0.18881777 = sum of:
        0.18881777 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
          0.18881777 = score(doc=562,freq=2.0), product of:
            0.3359639 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.03962768 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.33333334 = coord(1/3)
    0.18881777 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.18881777 = score(doc=562,freq=2.0), product of:
        0.3359639 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.03962768 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.18881777 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.18881777 = score(doc=562,freq=2.0), product of:
        0.3359639 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.03962768 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.01610701 = product of:
      0.03221402 = sum of:
        0.03221402 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
          0.03221402 = score(doc=562,freq=2.0), product of:
            0.13876937 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03962768 = queryNorm
            0.23214069 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.5 = coord(1/2)
  0.5 = coord(4/8)

Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32

Barthel, S.; Tönnies, S.; Balke, W.-T.: Large-scale experiments for mathematical document classification (2013) 0.05
```
0.04577707 = product of:
  0.18310829 = sum of:
    0.040918473 = weight(_text_:libraries in 1056) [ClassicSimilarity], result of:
      0.040918473 = score(doc=1056,freq=6.0), product of:
        0.13017908 = queryWeight, product of:
          3.2850544 = idf(docFreq=4499, maxDocs=44218)
          0.03962768 = queryNorm
        0.3143245 = fieldWeight in 1056, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.2850544 = idf(docFreq=4499, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1056)
    0.14218982 = weight(_text_:pacific in 1056) [ClassicSimilarity], result of:
      0.14218982 = score(doc=1056,freq=2.0), product of:
        0.3193714 = queryWeight, product of:
          8.059301 = idf(docFreq=37, maxDocs=44218)
          0.03962768 = queryNorm
        0.44521773 = fieldWeight in 1056, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.059301 = idf(docFreq=37, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1056)
  0.25 = coord(2/8)
```
Abstract

The ever increasing amount of digitally available information is curse and blessing at the same time. On the one hand, users have increasingly large amounts of information at their fingertips. On the other hand, the assessment and refinement of web search results becomes more and more tiresome and difficult for non-experts in a domain. Therefore, established digital libraries offer specialized collections with a certain degree of quality. This quality can largely be attributed to the great effort invested into semantic enrichment of the provided documents e.g. by annotating their documents with respect to a domain-specific taxonomy. This process is still done manually in many domains, e.g. chemistry CAS, medicine MeSH, or mathematics MSC. But due to the growing amount of data, this manual task gets more and more time consuming and expensive. The only solution for this problem seems to employ automated classification algorithms, but from evaluations done in previous research, conclusions to a real world scenario are difficult to make. We therefore conducted a large scale feasibility study on a real world data set from one of the biggest mathematical digital libraries, i.e. Zentralblatt MATH, with special focus on its practical applicability.

Source

15th International Conference on Asia-Pacific Digital Libraries ICADL 2013. Bangalore, India. [to appear, 2013]
Pong, J.Y.-H.; Kwok, R.C.-W.; Lau, R.Y.-K.; Hao, J.-X.; Wong, P.C.-C.: ¬A comparative study of two automatic document classification methods in a library setting (2008) 0.04
```
0.035564862 = product of:
  0.09483963 = sum of:
    0.033409793 = weight(_text_:libraries in 2532) [ClassicSimilarity], result of:
      0.033409793 = score(doc=2532,freq=4.0), product of:
        0.13017908 = queryWeight, product of:
          3.2850544 = idf(docFreq=4499, maxDocs=44218)
          0.03962768 = queryNorm
        0.25664487 = fieldWeight in 2532, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.2850544 = idf(docFreq=4499, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2532)
    0.034856133 = weight(_text_:studies in 2532) [ClassicSimilarity], result of:
      0.034856133 = score(doc=2532,freq=2.0), product of:
        0.15812531 = queryWeight, product of:
          3.9902744 = idf(docFreq=2222, maxDocs=44218)
          0.03962768 = queryNorm
        0.22043361 = fieldWeight in 2532, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.9902744 = idf(docFreq=2222, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2532)
    0.0265737 = product of:
      0.0531474 = sum of:
        0.0531474 = weight(_text_:area in 2532) [ClassicSimilarity], result of:
          0.0531474 = score(doc=2532,freq=2.0), product of:
            0.1952553 = queryWeight, product of:
              4.927245 = idf(docFreq=870, maxDocs=44218)
              0.03962768 = queryNorm
            0.27219442 = fieldWeight in 2532, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.927245 = idf(docFreq=870, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2532)
      0.5 = coord(1/2)
  0.375 = coord(3/8)
```
Abstract

In current library practice, trained human experts usually carry out document cataloguing and indexing based on a manual approach. With the explosive growth in the number of electronic documents available on the Internet and digital libraries, it is increasingly difficult for library practitioners to categorize both electronic documents and traditional library materials using just a manual approach. To improve the effectiveness and efficiency of document categorization at the library setting, more in-depth studies of using automatic document classification methods to categorize library items are required. Machine learning research has advanced rapidly in recent years. However, applying machine learning techniques to improve library practice is still a relatively unexplored area. This paper illustrates the design and development of a machine learning based automatic document classification system to alleviate the manual categorization problem encountered within the library setting. Two supervised machine learning algorithms have been tested. Our empirical tests show that supervised machine learning algorithms in general, and the k-nearest neighbours (KNN) algorithm in particular, can be used to develop an effective document classification system to enhance current library practice. Moreover, some concrete recommendations regarding how to practically apply the KNN algorithm to develop automatic document classification in a library setting are made. To our best knowledge, this is the first in-depth study of applying the KNN algorithm to automatic document classification based on the widely used LCC classification scheme adopted by many large libraries.

Liu, R.-L.: ¬A passage extractor for classification of disease aspect information (2013) 0.03

0.033971757 = product of:
  0.090591356 = sum of:
    0.042312715 = weight(_text_:case in 1107) [ClassicSimilarity], result of:
      0.042312715 = score(doc=1107,freq=2.0), product of:
        0.1742197 = queryWeight, product of:
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.03962768 = queryNorm
        0.24286987 = fieldWeight in 1107, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1107)
    0.034856133 = weight(_text_:studies in 1107) [ClassicSimilarity], result of:
      0.034856133 = score(doc=1107,freq=2.0), product of:
        0.15812531 = queryWeight, product of:
          3.9902744 = idf(docFreq=2222, maxDocs=44218)
          0.03962768 = queryNorm
        0.22043361 = fieldWeight in 1107, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.9902744 = idf(docFreq=2222, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1107)
    0.013422508 = product of:
      0.026845016 = sum of:
        0.026845016 = weight(_text_:22 in 1107) [ClassicSimilarity], result of:
          0.026845016 = score(doc=1107,freq=2.0), product of:
            0.13876937 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03962768 = queryNorm
            0.19345059 = fieldWeight in 1107, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1107)
      0.5 = coord(1/2)
  0.375 = coord(3/8)

Abstract: Retrieval of disease information is often based on several key aspects such as etiology, diagnosis, treatment, prevention, and symptoms of diseases. Automatic identification of disease aspect information is thus essential. In this article, I model the aspect identification problem as a text classification (TC) problem in which a disease aspect corresponds to a category. The disease aspect classification problem poses two challenges to classifiers: (a) a medical text often contains information about multiple aspects of a disease and hence produces noise for the classifiers and (b) text classifiers often cannot extract the textual parts (i.e., passages) about the categories of interest. I thus develop a technique, PETC (Passage Extractor for Text Classification), that extracts passages (from medical texts) for the underlying text classifiers to classify. Case studies on thousands of Chinese and English medical texts show that PETC enhances a support vector machine (SVM) classifier in classifying disease aspect information. PETC also performs better than three state-of-the-art classifier enhancement techniques, including two passage extraction techniques for text classifiers and a technique that employs term proximity information to enhance text classifiers. The contribution is of significance to evidence-based medicine, health education, and healthcare decision support. PETC can be used in those application domains in which a text to be classified may have several parts about different categories.
Date: 28.10.2013 19:22:57

Smiraglia, R.P.; Cai, X.: Tracking the evolution of clustering, machine learning, automatic indexing and automatic classification in knowledge organization (2017) 0.03
```
0.032367557 = product of:
  0.12947023 = sum of:
    0.0946141 = weight(_text_:case in 3627) [ClassicSimilarity], result of:
      0.0946141 = score(doc=3627,freq=10.0), product of:
        0.1742197 = queryWeight, product of:
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.03962768 = queryNorm
        0.54307353 = fieldWeight in 3627, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3627)
    0.034856133 = weight(_text_:studies in 3627) [ClassicSimilarity], result of:
      0.034856133 = score(doc=3627,freq=2.0), product of:
        0.15812531 = queryWeight, product of:
          3.9902744 = idf(docFreq=2222, maxDocs=44218)
          0.03962768 = queryNorm
        0.22043361 = fieldWeight in 3627, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.9902744 = idf(docFreq=2222, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3627)
  0.25 = coord(2/8)
```
Abstract

A very important extension of the traditional domain of knowledge organization (KO) arises from attempts to incorporate techniques devised in the computer science domain for automatic concept extraction and for grouping, categorizing, clustering and otherwise organizing knowledge using mechanical means. Four specific terms have emerged to identify the most prevalent techniques: machine learning, clustering, automatic indexing, and automatic classification. Our study presents three domain analytical case analyses in search of answers. The first case relies on citations located using the ISKO-supported "Knowledge Organization Bibliography." The second case relies on works in both Web of Science and SCOPUS. Case three applies co-word analysis and citation analysis to the contents of the papers in the present special issue. We observe scholars involved in "clustering" and "automatic classification" who share common thematic emphases. But we have found no coherence, no common activity and no social semantics. We have not found a research front, or a common teleology within the KO domain. We also have found a lively group of authors who have succeeded in submitting papers to this special issue, and their work quite interestingly aligns with the case studies we report. There is an emphasis on KO for information retrieval; there is much work on clustering (which involves conceptual points within texts) and automatic classification (which involves semantic groupings at the meta-document level).

Ingwersen, P.; Wormell, I.: Ranganathan in the perspective of advanced information retrieval (1992) 0.03

0.026374804 = product of:
  0.105499215 = sum of:
    0.037798867 = weight(_text_:libraries in 7695) [ClassicSimilarity], result of:
      0.037798867 = score(doc=7695,freq=2.0), product of:
        0.13017908 = queryWeight, product of:
          3.2850544 = idf(docFreq=4499, maxDocs=44218)
          0.03962768 = queryNorm
        0.29036054 = fieldWeight in 7695, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2850544 = idf(docFreq=4499, maxDocs=44218)
          0.0625 = fieldNorm(doc=7695)
    0.06770035 = weight(_text_:case in 7695) [ClassicSimilarity], result of:
      0.06770035 = score(doc=7695,freq=2.0), product of:
        0.1742197 = queryWeight, product of:
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.03962768 = queryNorm
        0.3885918 = fieldWeight in 7695, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.0625 = fieldNorm(doc=7695)
  0.25 = coord(2/8)

Abstract: Examnines Ranganathan's approach to knowledge organisation and its relevance to intellectual accessibility in libraries. Discusses the current and future developments of his methodology and theories in knowledge-based systems. Topics covered include: semi-automatic classification and structure of thesauri; user-intermediary interactions in information retrieval (IR); semantic value-theory and uncertainty principles in IR; and case grammar

Borko, H.: Research in computer based classification systems (1985) 0.02
```
0.024284061 = product of:
  0.064757496 = sum of:
    0.016537005 = weight(_text_:libraries in 3647) [ClassicSimilarity], result of:
      0.016537005 = score(doc=3647,freq=2.0), product of:
        0.13017908 = queryWeight, product of:
          3.2850544 = idf(docFreq=4499, maxDocs=44218)
          0.03962768 = queryNorm
        0.12703274 = fieldWeight in 3647, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2850544 = idf(docFreq=4499, maxDocs=44218)
          0.02734375 = fieldNorm(doc=3647)
    0.029618902 = weight(_text_:case in 3647) [ClassicSimilarity], result of:
      0.029618902 = score(doc=3647,freq=2.0), product of:
        0.1742197 = queryWeight, product of:
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.03962768 = queryNorm
        0.17000891 = fieldWeight in 3647, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.02734375 = fieldNorm(doc=3647)
    0.018601589 = product of:
      0.037203178 = sum of:
        0.037203178 = weight(_text_:area in 3647) [ClassicSimilarity], result of:
          0.037203178 = score(doc=3647,freq=2.0), product of:
            0.1952553 = queryWeight, product of:
              4.927245 = idf(docFreq=870, maxDocs=44218)
              0.03962768 = queryNorm
            0.19053608 = fieldWeight in 3647, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.927245 = idf(docFreq=870, maxDocs=44218)
              0.02734375 = fieldNorm(doc=3647)
      0.5 = coord(1/2)
  0.375 = coord(3/8)
```
Abstract

The selection in this reader by R. M. Needham and K. Sparck Jones reports an early approach to automatic classification that was taken in England. The following selection reviews various approaches that were being pursued in the United States at about the same time. It then discusses a particular approach initiated in the early 1960s by Harold Borko, at that time Head of the Language Processing and Retrieval Research Staff at the System Development Corporation, Santa Monica, California and, since 1966, a member of the faculty at the Graduate School of Library and Information Science, University of California, Los Angeles. As was described earlier, there are two steps in automatic classification, the first being to identify pairs of terms that are similar by virtue of co-occurring as index terms in the same documents, and the second being to form equivalence classes of intersubstitutable terms. To compute similarities, Borko and his associates used a standard correlation formula; to derive classification categories, where Needham and Sparck Jones used clumping, the Borko team used the statistical technique of factor analysis. The fact that documents can be classified automatically, and in any number of ways, is worthy of passing notice. Worthy of serious attention would be a demonstra tion that a computer-based classification system was effective in the organization and retrieval of documents. One reason for the inclusion of the following selection in the reader is that it addresses the question of evaluation. To evaluate the effectiveness of their automatically derived classification, Borko and his team asked three questions. The first was Is the classification reliable? in other words, could the categories derived from one sample of texts be used to classify other texts? Reliability was assessed by a case-study comparison of the classes derived from three different samples of abstracts. The notso-surprising conclusion reached was that automatically derived classes were reliable only to the extent that the sample from which they were derived was representative of the total document collection. The second evaluation question asked whether the classification was reasonable, in the sense of adequately describing the content of the document collection. The answer was sought by comparing the automatically derived categories with categories in a related classification system that was manually constructed. Here the conclusion was that the automatic method yielded categories that fairly accurately reflected the major area of interest in the sample collection of texts; however, since there were only eleven such categories and they were quite broad, they could not be regarded as suitable for use in a university or any large general library. The third evaluation question asked whether automatic classification was accurate, in the sense of producing results similar to those obtainabie by human cIassifiers. When using human classification as a criterion, automatic classification was found to be 50 percent accurate.

Imprint

Littleton, CO : Libraries Unlimited
Cathey, R.J.; Jensen, E.C.; Beitzel, S.M.; Frieder, O.; Grossman, D.: Exploiting parallelism to support scalable hierarchical clustering (2007) 0.02
```
0.019292213 = product of:
  0.07716885 = sum of:
    0.042312715 = weight(_text_:case in 448) [ClassicSimilarity], result of:
      0.042312715 = score(doc=448,freq=2.0), product of:
        0.1742197 = queryWeight, product of:
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.03962768 = queryNorm
        0.24286987 = fieldWeight in 448, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.0390625 = fieldNorm(doc=448)
    0.034856133 = weight(_text_:studies in 448) [ClassicSimilarity], result of:
      0.034856133 = score(doc=448,freq=2.0), product of:
        0.15812531 = queryWeight, product of:
          3.9902744 = idf(docFreq=2222, maxDocs=44218)
          0.03962768 = queryNorm
        0.22043361 = fieldWeight in 448, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.9902744 = idf(docFreq=2222, maxDocs=44218)
          0.0390625 = fieldNorm(doc=448)
  0.25 = coord(2/8)
```
Abstract

A distributed memory parallel version of the group average hierarchical agglomerative clustering algorithm is proposed to enable scaling the document clustering problem to large collections. Using standard message passing operations reduces interprocess communication while maintaining efficient load balancing. In a series of experiments using a subset of a standard Text REtrieval Conference (TREC) test collection, our parallel hierarchical clustering algorithm is shown to be scalable in terms of processors efficiently used and the collection size. Results show that our algorithm performs close to the expected O(n**2/p) time on p processors rather than the worst-case O(n**3/p) time. Furthermore, the O(n**2/p) memory complexity per node allows larger collections to be clustered as the number of nodes increases. While partitioning algorithms such as k-means are trivially parallelizable, our results confirm those of other studies which showed that hierarchical algorithms produce significantly tighter clusters in the document clustering task. Finally, we show how our parallel hierarchical agglomerative clustering algorithm can be used as the clustering subroutine for a parallel version of the buckshot algorithm to cluster the complete TREC collection at near theoretical runtime expectations.

Pfeffer, M.: Automatische Vergabe von RVK-Notationen mittels fallbasiertem Schließen (2009) 0.02

0.016720567 = product of:
  0.06688227 = sum of:
    0.05077526 = weight(_text_:case in 3051) [ClassicSimilarity], result of:
      0.05077526 = score(doc=3051,freq=2.0), product of:
        0.1742197 = queryWeight, product of:
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.03962768 = queryNorm
        0.29144385 = fieldWeight in 3051, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.046875 = fieldNorm(doc=3051)
    0.01610701 = product of:
      0.03221402 = sum of:
        0.03221402 = weight(_text_:22 in 3051) [ClassicSimilarity], result of:
          0.03221402 = score(doc=3051,freq=2.0), product of:
            0.13876937 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03962768 = queryNorm
            0.23214069 = fieldWeight in 3051, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=3051)
      0.5 = coord(1/2)
  0.25 = coord(2/8)

Date: 22. 8.2009 19:51:28
Theme: Case Based Reasoning

Ibekwe-SanJuan, F.; SanJuan, E.: From term variants to research topics (2002) 0.02
```
0.015357459 = product of:
  0.061429836 = sum of:
    0.034856133 = weight(_text_:studies in 1853) [ClassicSimilarity], result of:
      0.034856133 = score(doc=1853,freq=2.0), product of:
        0.15812531 = queryWeight, product of:
          3.9902744 = idf(docFreq=2222, maxDocs=44218)
          0.03962768 = queryNorm
        0.22043361 = fieldWeight in 1853, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.9902744 = idf(docFreq=2222, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1853)
    0.0265737 = product of:
      0.0531474 = sum of:
        0.0531474 = weight(_text_:area in 1853) [ClassicSimilarity], result of:
          0.0531474 = score(doc=1853,freq=2.0), product of:
            0.1952553 = queryWeight, product of:
              4.927245 = idf(docFreq=870, maxDocs=44218)
              0.03962768 = queryNorm
            0.27219442 = fieldWeight in 1853, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.927245 = idf(docFreq=870, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1853)
      0.5 = coord(1/2)
  0.25 = coord(2/8)
```
Abstract

In a scientific and technological watch (STW) task, an expert user needs to survey the evolution of research topics in his area of specialisation in order to detect interesting changes. The majority of methods proposing evaluation metrics (bibliometrics and scientometrics studies) for STW rely solely an statistical data analysis methods (Co-citation analysis, co-word analysis). Such methods usually work an structured databases where the units of analysis (words, keywords) are already attributed to documents by human indexers. The advent of huge amounts of unstructured textual data has rendered necessary the integration of natural language processing (NLP) techniques to first extract meaningful units from texts. We propose a method for STW which is NLP-oriented. The method not only analyses texts linguistically in order to extract terms from them, but also uses linguistic relations (syntactic variations) as the basis for clustering. Terms and variation relations are formalised as weighted di-graphs which the clustering algorithm, CPCL (Classification by Preferential Clustered Link) will seek to reduce in order to produces classes. These classes ideally represent the research topics present in the corpus. The results of the classification are subjected to validation by an expert in STW.

Egbert, J.; Biber, D.; Davies, M.: Developing a bottom-up, user-based method of web register classification (2015) 0.01

0.014483592 = product of:
  0.057934366 = sum of:
    0.04182736 = weight(_text_:studies in 2158) [ClassicSimilarity], result of:
      0.04182736 = score(doc=2158,freq=2.0), product of:
        0.15812531 = queryWeight, product of:
          3.9902744 = idf(docFreq=2222, maxDocs=44218)
          0.03962768 = queryNorm
        0.26452032 = fieldWeight in 2158, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.9902744 = idf(docFreq=2222, maxDocs=44218)
          0.046875 = fieldNorm(doc=2158)
    0.01610701 = product of:
      0.03221402 = sum of:
        0.03221402 = weight(_text_:22 in 2158) [ClassicSimilarity], result of:
          0.03221402 = score(doc=2158,freq=2.0), product of:
            0.13876937 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03962768 = queryNorm
            0.23214069 = fieldWeight in 2158, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=2158)
      0.5 = coord(1/2)
  0.25 = coord(2/8)

Abstract: This paper introduces a project to develop a reliable, cost-effective method for classifying Internet texts into register categories, and apply that approach to the analysis of a large corpus of web documents. To date, the project has proceeded in 2 key phases. First, we developed a bottom-up method for web register classification, asking end users of the web to utilize a decision-tree survey to code relevant situational characteristics of web documents, resulting in a bottom-up identification of register and subregister categories. We present details regarding the development and testing of this method through a series of 10 pilot studies. Then, in the second phase of our project we applied this procedure to a corpus of 53,000 web documents. An analysis of the results demonstrates the effectiveness of these methods for web register classification and provides a preliminary description of the types and distribution of registers on the web.
Date: 4. 8.2015 19:22:04

Automatic classification research at OCLC (2002) 0.01

0.01296638 = product of:
  0.05186552 = sum of:
    0.03307401 = weight(_text_:libraries in 1563) [ClassicSimilarity], result of:
      0.03307401 = score(doc=1563,freq=2.0), product of:
        0.13017908 = queryWeight, product of:
          3.2850544 = idf(docFreq=4499, maxDocs=44218)
          0.03962768 = queryNorm
        0.25406548 = fieldWeight in 1563, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2850544 = idf(docFreq=4499, maxDocs=44218)
          0.0546875 = fieldNorm(doc=1563)
    0.018791512 = product of:
      0.037583023 = sum of:
        0.037583023 = weight(_text_:22 in 1563) [ClassicSimilarity], result of:
          0.037583023 = score(doc=1563,freq=2.0), product of:
            0.13876937 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03962768 = queryNorm
            0.2708308 = fieldWeight in 1563, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1563)
      0.5 = coord(1/2)
  0.25 = coord(2/8)

Abstract: OCLC enlists the cooperation of the world's libraries to make the written record of humankind's cultural heritage more accessible through electronic media. Part of this goal can be accomplished through the application of the principles of knowledge organization. We believe that cultural artifacts are effectively lost unless they are indexed, cataloged and classified. Accordingly, OCLC has developed products, sponsored research projects, and encouraged the participation in international standards communities whose outcome has been improved library classification schemes, cataloging productivity tools, and new proposals for the creation and maintenance of metadata. Though cataloging and classification requires expert intellectual effort, we recognize that at least some of the work must be automated if we hope to keep pace with cultural change
Date: 5. 5.2003 9:22:09

Khoo, C.S.G.; Ng, K.; Ou, S.: ¬An exploratory study of human clustering of Web pages (2003) 0.01
```
0.009655728 = product of:
  0.038622912 = sum of:
    0.027884906 = weight(_text_:studies in 2741) [ClassicSimilarity], result of:
      0.027884906 = score(doc=2741,freq=2.0), product of:
        0.15812531 = queryWeight, product of:
          3.9902744 = idf(docFreq=2222, maxDocs=44218)
          0.03962768 = queryNorm
        0.17634688 = fieldWeight in 2741, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.9902744 = idf(docFreq=2222, maxDocs=44218)
          0.03125 = fieldNorm(doc=2741)
    0.010738007 = product of:
      0.021476014 = sum of:
        0.021476014 = weight(_text_:22 in 2741) [ClassicSimilarity], result of:
          0.021476014 = score(doc=2741,freq=2.0), product of:
            0.13876937 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03962768 = queryNorm
            0.15476047 = fieldWeight in 2741, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.03125 = fieldNorm(doc=2741)
      0.5 = coord(1/2)
  0.25 = coord(2/8)
```
Abstract

This study seeks to find out how human beings cluster Web pages naturally. Twenty Web pages retrieved by the Northem Light search engine for each of 10 queries were sorted by 3 subjects into categories that were natural or meaningful to them. lt was found that different subjects clustered the same set of Web pages quite differently and created different categories. The average inter-subject similarity of the clusters created was a low 0.27. Subjects created an average of 5.4 clusters for each sorting. The categories constructed can be divided into 10 types. About 1/3 of the categories created were topical. Another 20% of the categories relate to the degree of relevance or usefulness. The rest of the categories were subject-independent categories such as format, purpose, authoritativeness and direction to other sources. The authors plan to develop automatic methods for categorizing Web pages using the common categories created by the subjects. lt is hoped that the techniques developed can be used by Web search engines to automatically organize Web pages retrieved into categories that are natural to users. 1. Introduction The World Wide Web is an increasingly important source of information for people globally because of its ease of access, the ease of publishing, its ability to transcend geographic and national boundaries, its flexibility and heterogeneity and its dynamic nature. However, Web users also find it increasingly difficult to locate relevant and useful information in this vast information storehouse. Web search engines, despite their scope and power, appear to be quite ineffective. They retrieve too many pages, and though they attempt to rank retrieved pages in order of probable relevance, often the relevant documents do not appear in the top-ranked 10 or 20 documents displayed. Several studies have found that users do not know how to use the advanced features of Web search engines, and do not know how to formulate and re-formulate queries. Users also typically exert minimal effort in performing, evaluating and refining their searches, and are unwilling to scan more than 10 or 20 items retrieved (Jansen, Spink, Bateman & Saracevic, 1998). This suggests that the conventional ranked-list display of search results does not satisfy user requirements, and that better ways of presenting and summarizing search results have to be developed. One promising approach is to group retrieved pages into clusters or categories to allow users to navigate immediately to the "promising" clusters where the most useful Web pages are likely to be located. This approach has been adopted by a number of search engines (notably Northem Light) and search agents.

Date

12. 9.2004 9:56:22
Montesi, M.; Navarrete, T.: Classifying web genres in context : A case study documenting the web genres used by a software engineer (2008) 0.01
```
0.008975883 = product of:
  0.071807064 = sum of:
    0.071807064 = weight(_text_:case in 2100) [ClassicSimilarity], result of:
      0.071807064 = score(doc=2100,freq=4.0), product of:
        0.1742197 = queryWeight, product of:
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.03962768 = queryNorm
        0.41216385 = fieldWeight in 2100, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.046875 = fieldNorm(doc=2100)
  0.125 = coord(1/8)
```
Abstract

This case study analyzes the Internet-based resources that a software engineer uses in his daily work. Methodologically, we studied the web browser history of the participant, classifying all the web pages he had seen over a period of 12 days into web genres. We interviewed him before and after the analysis of the web browser history. In the first interview, he spoke about his general information behavior; in the second, he commented on each web genre, explaining why and how he used them. As a result, three approaches allow us to describe the set of 23 web genres obtained: (a) the purposes they serve for the participant; (b) the role they play in the various work and search phases; (c) and the way they are used in combination with each other. Further observations concern the way the participant assesses quality of web-based resources, and his information behavior as a software engineer.

Bianchini, C.; Bargioni, S.: Automated classification using linked open data : a case study on faceted classification and Wikidata (2021) 0.01

0.0074047255 = product of:
  0.059237804 = sum of:
    0.059237804 = weight(_text_:case in 724) [ClassicSimilarity], result of:
      0.059237804 = score(doc=724,freq=2.0), product of:
        0.1742197 = queryWeight, product of:
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.03962768 = queryNorm
        0.34001783 = fieldWeight in 724, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.0546875 = fieldNorm(doc=724)
  0.125 = coord(1/8)

Ozmutlu, S.; Cosar, G.C.: Analyzing the results of automatic new topic identification (2008) 0.01
```
0.0073941024 = product of:
  0.05915282 = sum of:
    0.05915282 = weight(_text_:studies in 2604) [ClassicSimilarity], result of:
      0.05915282 = score(doc=2604,freq=4.0), product of:
        0.15812531 = queryWeight, product of:
          3.9902744 = idf(docFreq=2222, maxDocs=44218)
          0.03962768 = queryNorm
        0.37408823 = fieldWeight in 2604, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.9902744 = idf(docFreq=2222, maxDocs=44218)
          0.046875 = fieldNorm(doc=2604)
  0.125 = coord(1/8)
```
Abstract

Purpose - Identification of topic changes within a user search session is a key issue in content analysis of search engine user queries. Recently, various studies have focused on new topic identification/session identification of search engine transaction logs, and several problems regarding the estimation of topic shifts and continuations were observed in these studies. This study aims to analyze the reasons for the problems that were encountered as a result of applying automatic new topic identification. Design/methodology/approach - Measures, such as cleaning the data of common words and analyzing the errors of automatic new topic identification, are applied to eliminate the problems in estimating topic shifts and continuations. Findings - The findings show that the resulting errors of automatic new topic identification have a pattern, and further research is required to improve the performance of automatic new topic identification. Originality/value - Improving the performance of automatic new topic identification would be valuable to search engine designers, so that they can develop new clustering and query recommendation algorithms, as well as custom-tailored graphical user interfaces for search engine users.

Yi, K.: Challenges in automated classification using library classification schemes (2006) 0.01

0.0069712265 = product of:
  0.055769812 = sum of:
    0.055769812 = weight(_text_:studies in 5810) [ClassicSimilarity], result of:
      0.055769812 = score(doc=5810,freq=2.0), product of:
        0.15812531 = queryWeight, product of:
          3.9902744 = idf(docFreq=2222, maxDocs=44218)
          0.03962768 = queryNorm
        0.35269377 = fieldWeight in 5810, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.9902744 = idf(docFreq=2222, maxDocs=44218)
          0.0625 = fieldNorm(doc=5810)
  0.125 = coord(1/8)

Abstract: A major library classification scheme has long been standard classification framework for information sources in traditional library environment, and text classification (TC) becomes a popular and attractive tool of organizing digital information. This paper gives an overview of previous projects and studies on TC using major library classification schemes, and summarizes a discussion of TC research challenges.

Larson, R.R.: Experiments in automatic Library of Congress Classification (1992) 0.01
```
0.0063469075 = product of:
  0.05077526 = sum of:
    0.05077526 = weight(_text_:case in 1054) [ClassicSimilarity], result of:
      0.05077526 = score(doc=1054,freq=2.0), product of:
        0.1742197 = queryWeight, product of:
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.03962768 = queryNorm
        0.29144385 = fieldWeight in 1054, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.046875 = fieldNorm(doc=1054)
  0.125 = coord(1/8)
```
Abstract

This article presents the results of research into the automatic selection of Library of Congress Classification numbers based on the titles and subject headings in MARC records. The method used in this study was based on partial match retrieval techniques using various elements of new recors (i.e., those to be classified) as "queries", and a test database of classification clusters generated from previously classified MARC records. Sixty individual methods for automatic classification were tested on a set of 283 new records, using all combinations of four different partial match methods, five query types, and three representations of search terms. The results indicate that if the best method for a particular case can be determined, then up to 86% of the new records may be correctly classified. The single method with the best accuracy was able to select the correct classification for about 46% of the new records.

Pfeffer, M.: Automatische Vergabe von RVK-Notationen anhand von bibliografischen Daten mittels fallbasiertem Schließen (2007) 0.01

0.0063469075 = product of:
  0.05077526 = sum of:
    0.05077526 = weight(_text_:case in 558) [ClassicSimilarity], result of:
      0.05077526 = score(doc=558,freq=2.0), product of:
        0.1742197 = queryWeight, product of:
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.03962768 = queryNorm
        0.29144385 = fieldWeight in 558, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.046875 = fieldNorm(doc=558)
  0.125 = coord(1/8)

Theme: Case Based Reasoning

Wu, M.; Liu, Y.-H.; Brownlee, R.; Zhang, X.: Evaluating utility and automatic classification of subject metadata from Research Data Australia (2021) 0.01
```
0.0063469075 = product of:
  0.05077526 = sum of:
    0.05077526 = weight(_text_:case in 453) [ClassicSimilarity], result of:
      0.05077526 = score(doc=453,freq=2.0), product of:
        0.1742197 = queryWeight, product of:
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.03962768 = queryNorm
        0.29144385 = fieldWeight in 453, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.3964143 = idf(docFreq=1480, maxDocs=44218)
          0.046875 = fieldNorm(doc=453)
  0.125 = coord(1/8)
```
Abstract

In this paper, we present a case study of how well subject metadata (comprising headings from an international classification scheme) has been deployed in a national data catalogue, and how often data seekers use subject metadata when searching for data. Through an analysis of user search behaviour as recorded in search logs, we find evidence that users utilise the subject metadata for data discovery. Since approximately half of the records ingested by the catalogue did not include subject metadata at the time of harvest, we experimented with automatic subject classification approaches in order to enrich these records and to provide additional support for user search and data discovery. Our results show that automatic methods work well for well represented categories of subject metadata, and these categories tend to have features that can distinguish themselves from the other categories. Our findings raise implications for data catalogue providers; they should invest more effort to enhance the quality of data records by providing an adequate description of these records for under-represented subject categories.

Search (61 results, page 1 of 4)

Authors

Years

Languages

Types

Themes

Subjects