Search (47 results, page 1 of 3)

Zhu, W.Z.; Allen, R.B.: Document clustering using the LSI subspace signature model (2013) 0.02

0.02030736 = product of:
  0.05584524 = sum of:
    0.0040582716 = weight(_text_:a in 690) [ClassicSimilarity], result of:
      0.0040582716 = score(doc=690,freq=6.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.13239266 = fieldWeight in 690, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.046875 = fieldNorm(doc=690)
    0.0020832212 = weight(_text_:s in 690) [ClassicSimilarity], result of:
      0.0020832212 = score(doc=690,freq=2.0), product of:
        0.028903782 = queryWeight, product of:
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.026584605 = queryNorm
        0.072074346 = fieldWeight in 690, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.046875 = fieldNorm(doc=690)
    0.038898204 = weight(_text_:k in 690) [ClassicSimilarity], result of:
      0.038898204 = score(doc=690,freq=6.0), product of:
        0.09490114 = queryWeight, product of:
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.026584605 = queryNorm
        0.40988132 = fieldWeight in 690, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.046875 = fieldNorm(doc=690)
    0.010805541 = product of:
      0.021611081 = sum of:
        0.021611081 = weight(_text_:22 in 690) [ClassicSimilarity], result of:
          0.021611081 = score(doc=690,freq=2.0), product of:
            0.09309476 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.026584605 = queryNorm
            0.23214069 = fieldWeight in 690, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=690)
      0.5 = coord(1/2)
  0.36363637 = coord(4/11)

Abstract: We describe the latent semantic indexing subspace signature model (LSISSM) for semantic content representation of unstructured text. Grounded on singular value decomposition, the model represents terms and documents by the distribution signatures of their statistical contribution across the top-ranking latent concept dimensions. LSISSM matches term signatures with document signatures according to their mapping coherence between latent semantic indexing (LSI) term subspace and LSI document subspace. LSISSM does feature reduction and finds a low-rank approximation of scalable and sparse term-document matrices. Experiments demonstrate that this approach significantly improves the performance of major clustering algorithms such as standard K-means and self-organizing maps compared with the vector space model and the traditional LSI model. The unique contribution ranking mechanism in LSISSM also improves the initialization of standard K-means compared with random seeding procedure, which sometimes causes low efficiency and effectiveness of clustering. A two-stage initialization strategy based on LSISSM significantly reduces the running time of standard K-means procedures.
Date: 23. 3.2013 13:22:36
Source: Journal of the American Society for Information Science and Technology. 64(2013) no.4, S.844-860
Type: a

Sojka, P.; Lee, M.; Rehurek, R.; Hatlapatka, R.; Kucbel, M.; Bouche, T.; Goutorbe, C.; Anghelache, R.; Wojciechowski, K.: Toolset for entity and semantic associations : Final Release (2013) 0.02

0.015886089 = product of:
  0.05824899 = sum of:
    0.0023430442 = weight(_text_:a in 1057) [ClassicSimilarity], result of:
      0.0023430442 = score(doc=1057,freq=2.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.07643694 = fieldWeight in 1057, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.046875 = fieldNorm(doc=1057)
    0.03344806 = weight(_text_:r in 1057) [ClassicSimilarity], result of:
      0.03344806 = score(doc=1057,freq=6.0), product of:
        0.088001914 = queryWeight, product of:
          3.3102584 = idf(docFreq=4387, maxDocs=44218)
          0.026584605 = queryNorm
        0.38008332 = fieldWeight in 1057, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.3102584 = idf(docFreq=4387, maxDocs=44218)
          0.046875 = fieldNorm(doc=1057)
    0.022457888 = weight(_text_:k in 1057) [ClassicSimilarity], result of:
      0.022457888 = score(doc=1057,freq=2.0), product of:
        0.09490114 = queryWeight, product of:
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.026584605 = queryNorm
        0.23664509 = fieldWeight in 1057, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.046875 = fieldNorm(doc=1057)
  0.27272728 = coord(3/11)

Abstract: In this document we describe the final release of the toolset for entity and semantic associations, integrating two versions (language dependent and language independent) of Unsupervised Document Similarity implemented by MU (using gensim tool) and Citation Indexing, Resolution and Matching (UJF/CMD). We give a brief description of tools, the rationale behind decisions made, and provide elementary evaluation. Tools are integrated in the main project result, EuDML website, and they deliver the needed functionality for exploratory searching and browsing the collected documents. EuDML users and content providers thus benefit from millions of algorithmically generated similarity and citation links, developed using state of the art machine learning and matching methods.

Golub, K.; Hansson, J.; Soergel, D.; Tudhope, D.: Managing classification in libraries : a methodological outline for evaluating automatic subject indexing and classification in Swedish library catalogues (2015) 0.02

0.015432358 = product of:
  0.042438984 = sum of:
    0.0055226083 = weight(_text_:a in 2300) [ClassicSimilarity], result of:
      0.0055226083 = score(doc=2300,freq=16.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.18016359 = fieldWeight in 2300, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2300)
    0.0024550997 = weight(_text_:s in 2300) [ClassicSimilarity], result of:
      0.0024550997 = score(doc=2300,freq=4.0), product of:
        0.028903782 = queryWeight, product of:
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.026584605 = queryNorm
        0.08494043 = fieldWeight in 2300, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2300)
    0.015746372 = weight(_text_:u in 2300) [ClassicSimilarity], result of:
      0.015746372 = score(doc=2300,freq=2.0), product of:
        0.08704981 = queryWeight, product of:
          3.2744443 = idf(docFreq=4547, maxDocs=44218)
          0.026584605 = queryNorm
        0.1808892 = fieldWeight in 2300, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2744443 = idf(docFreq=4547, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2300)
    0.018714907 = weight(_text_:k in 2300) [ClassicSimilarity], result of:
      0.018714907 = score(doc=2300,freq=2.0), product of:
        0.09490114 = queryWeight, product of:
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.026584605 = queryNorm
        0.19720423 = fieldWeight in 2300, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2300)
  0.36363637 = coord(4/11)

Abstract: Subject terms play a crucial role in resource discovery but require substantial effort to produce. Automatic subject classification and indexing address problems of scale and sustainability and can be used to enrich existing bibliographic records, establish more connections across and between resources and enhance consistency of bibliographic data. The paper aims to put forward a complex methodological framework to evaluate automatic classification tools of Swedish textual documents based on the Dewey Decimal Classification (DDC) recently introduced to Swedish libraries. Three major complementary approaches are suggested: a quality-built gold standard, retrieval effects, domain analysis. The gold standard is built based on input from at least two catalogue librarians, end-users expert in the subject, end users inexperienced in the subject and automated tools. Retrieval effects are studied through a combination of assigned and free tasks, including factual and comprehensive types. The study also takes into consideration the different role and character of subject terms in various knowledge domains, such as scientific disciplines. As a theoretical framework, domain analysis is used and applied in relation to the implementation of DDC in Swedish libraries and chosen domains of knowledge within the DDC itself.
Location: S
Pages: S.163-175
Source: Classification and authority control: expanding resource discovery: proceedings of the International UDC Seminar 2015, 29-30 October 2015, Lisbon, Portugal. Eds.: Slavic, A. u. M.I. Cordeiro
Type: a

Liu, R.-L.: ¬A passage extractor for classification of disease aspect information (2013) 0.01

0.012217137 = product of:
  0.033597127 = sum of:
    0.0067637865 = weight(_text_:a in 1107) [ClassicSimilarity], result of:
      0.0067637865 = score(doc=1107,freq=24.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.22065444 = fieldWeight in 1107, product of:
          4.8989797 = tf(freq=24.0), with freq of:
            24.0 = termFreq=24.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1107)
    0.016092705 = weight(_text_:r in 1107) [ClassicSimilarity], result of:
      0.016092705 = score(doc=1107,freq=2.0), product of:
        0.088001914 = queryWeight, product of:
          3.3102584 = idf(docFreq=4387, maxDocs=44218)
          0.026584605 = queryNorm
        0.18286766 = fieldWeight in 1107, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.3102584 = idf(docFreq=4387, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1107)
    0.0017360178 = weight(_text_:s in 1107) [ClassicSimilarity], result of:
      0.0017360178 = score(doc=1107,freq=2.0), product of:
        0.028903782 = queryWeight, product of:
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.026584605 = queryNorm
        0.060061958 = fieldWeight in 1107, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1107)
    0.009004618 = product of:
      0.018009236 = sum of:
        0.018009236 = weight(_text_:22 in 1107) [ClassicSimilarity], result of:
          0.018009236 = score(doc=1107,freq=2.0), product of:
            0.09309476 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.026584605 = queryNorm
            0.19345059 = fieldWeight in 1107, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1107)
      0.5 = coord(1/2)
  0.36363637 = coord(4/11)

Abstract: Retrieval of disease information is often based on several key aspects such as etiology, diagnosis, treatment, prevention, and symptoms of diseases. Automatic identification of disease aspect information is thus essential. In this article, I model the aspect identification problem as a text classification (TC) problem in which a disease aspect corresponds to a category. The disease aspect classification problem poses two challenges to classifiers: (a) a medical text often contains information about multiple aspects of a disease and hence produces noise for the classifiers and (b) text classifiers often cannot extract the textual parts (i.e., passages) about the categories of interest. I thus develop a technique, PETC (Passage Extractor for Text Classification), that extracts passages (from medical texts) for the underlying text classifiers to classify. Case studies on thousands of Chinese and English medical texts show that PETC enhances a support vector machine (SVM) classifier in classifying disease aspect information. PETC also performs better than three state-of-the-art classifier enhancement techniques, including two passage extraction techniques for text classifiers and a technique that employs term proximity information to enhance text classifiers. The contribution is of significance to evidence-based medicine, health education, and healthcare decision support. PETC can be used in those application domains in which a text to be classified may have several parts about different categories.
Date: 28.10.2013 19:22:57
Source: Journal of the American Society for Information Science and Technology. 64(2013) no.11, S.2265-2277
Type: a

Alberts, I.; Forest, D.: Email pragmatics and automatic classification : a study in the organizational context (2012) 0.01
```
0.0107228495 = product of:
  0.039317112 = sum of:
    0.0051659266 = weight(_text_:a in 238) [ClassicSimilarity], result of:
      0.0051659266 = score(doc=238,freq=14.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.1685276 = fieldWeight in 238, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.0390625 = fieldNorm(doc=238)
    0.0017360178 = weight(_text_:s in 238) [ClassicSimilarity], result of:
      0.0017360178 = score(doc=238,freq=2.0), product of:
        0.028903782 = queryWeight, product of:
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.026584605 = queryNorm
        0.060061958 = fieldWeight in 238, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.0390625 = fieldNorm(doc=238)
    0.032415166 = weight(_text_:k in 238) [ClassicSimilarity], result of:
      0.032415166 = score(doc=238,freq=6.0), product of:
        0.09490114 = queryWeight, product of:
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.026584605 = queryNorm
        0.34156775 = fieldWeight in 238, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.0390625 = fieldNorm(doc=238)
  0.27272728 = coord(3/11)
```
Abstract

This paper presents a two-phased research project aiming to improve email triage for public administration managers. The first phase developed a typology of email classification patterns through a qualitative study involving 34 participants. Inspired by the fields of pragmatics and speech act theory, this typology comprising four top level categories and 13 subcategories represents the typical email triage behaviors of managers in an organizational context. The second study phase was conducted on a corpus of 1,703 messages using email samples of two managers. Using the k-NN (k-nearest neighbor) algorithm, statistical treatments automatically classified the email according to lexical and nonlexical features representative of managers' triage patterns. The automatic classification of email according to the lexicon of the messages was found to be substantially more efficient when k = 2 and n = 2,000. For four categories, the average recall rate was 94.32%, the average precision rate was 94.50%, and the accuracy rate was 94.54%. For 13 categories, the average recall rate was 91.09%, the average precision rate was 84.18%, and the accuracy rate was 88.70%. It appears that a message's nonlexical features are also deeply influenced by email pragmatics. Features related to the recipient and the sender were the most relevant for characterizing email.

Source

Journal of the American Society for Information Science and Technology. 63(2012) no.5, S.904-922

Type

a

Golub, K.: Automated subject classification of textual documents in the context of Web-based hierarchical browsing (2011) 0.01

0.008121905 = product of:
  0.029780317 = sum of:
    0.0052392064 = weight(_text_:a in 4558) [ClassicSimilarity], result of:
      0.0052392064 = score(doc=4558,freq=10.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.1709182 = fieldWeight in 4558, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.046875 = fieldNorm(doc=4558)
    0.0020832212 = weight(_text_:s in 4558) [ClassicSimilarity], result of:
      0.0020832212 = score(doc=4558,freq=2.0), product of:
        0.028903782 = queryWeight, product of:
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.026584605 = queryNorm
        0.072074346 = fieldWeight in 4558, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.046875 = fieldNorm(doc=4558)
    0.022457888 = weight(_text_:k in 4558) [ClassicSimilarity], result of:
      0.022457888 = score(doc=4558,freq=2.0), product of:
        0.09490114 = queryWeight, product of:
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.026584605 = queryNorm
        0.23664509 = fieldWeight in 4558, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.046875 = fieldNorm(doc=4558)
  0.27272728 = coord(3/11)

Abstract: While automated methods for information organization have been around for several decades now, exponential growth of the World Wide Web has put them into the forefront of research in different communities, within which several approaches can be identified: 1) machine learning (algorithms that allow computers to improve their performance based on learning from pre-existing data); 2) document clustering (algorithms for unsupervised document organization and automated topic extraction); and 3) string matching (algorithms that match given strings within larger text). Here the aim was to automatically organize textual documents into hierarchical structures for subject browsing. The string-matching approach was tested using a controlled vocabulary (containing pre-selected and pre-defined authorized terms, each corresponding to only one concept). The results imply that an appropriate controlled vocabulary, with a sufficient number of entry terms designating classes, could in itself be a solution for automated classification. Then, if the same controlled vocabulary had an appropriat hierarchical structure, it would at the same time provide a good browsing structure for the collection of automatically classified documents.
Source: Knowledge organization. 38(2011) no.3, S.230-244
Type: a

Kishida, K.: High-speed rough clustering for very large document collections (2010) 0.01

0.0070836907 = product of:
  0.025973532 = sum of:
    0.0055226083 = weight(_text_:a in 3463) [ClassicSimilarity], result of:
      0.0055226083 = score(doc=3463,freq=16.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.18016359 = fieldWeight in 3463, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3463)
    0.0017360178 = weight(_text_:s in 3463) [ClassicSimilarity], result of:
      0.0017360178 = score(doc=3463,freq=2.0), product of:
        0.028903782 = queryWeight, product of:
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.026584605 = queryNorm
        0.060061958 = fieldWeight in 3463, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3463)
    0.018714907 = weight(_text_:k in 3463) [ClassicSimilarity], result of:
      0.018714907 = score(doc=3463,freq=2.0), product of:
        0.09490114 = queryWeight, product of:
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.026584605 = queryNorm
        0.19720423 = fieldWeight in 3463, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3463)
  0.27272728 = coord(3/11)

Abstract: Document clustering is an important tool, but it is not yet widely used in practice probably because of its high computational complexity. This article explores techniques of high-speed rough clustering of documents, assuming that it is sometimes necessary to obtain a clustering result in a shorter time, although the result is just an approximate outline of document clusters. A promising approach for such clustering is to reduce the number of documents to be checked for generating cluster vectors in the leader-follower clustering algorithm. Based on this idea, the present article proposes a modified Crouch algorithm and incomplete single-pass leader-follower algorithm. Also, a two-stage grouping technique, in which the first stage attempts to decrease the number of documents to be processed in the second stage by applying a quick merging technique, is developed. An experiment using a part of the Reuters corpus RCV1 showed empirically that both the modified Crouch and the incomplete single-pass leader-follower algorithms achieve clustering results more efficiently than the original methods, and also improved the effectiveness of clustering results. On the other hand, the two-stage grouping technique did not reduce the processing time in this experiment.
Source: Journal of the American Society for Information Science and Technology. 61(2010) no.6, S.1092-1104
Type: a

Golub, K.; Soergel, D.; Buchanan, G.; Tudhope, D.; Lykke, M.; Hiom, D.: ¬A framework for evaluating automatic indexing or classification in the context of retrieval (2016) 0.01

0.006986414 = product of:
  0.02561685 = sum of:
    0.0051659266 = weight(_text_:a in 3311) [ClassicSimilarity], result of:
      0.0051659266 = score(doc=3311,freq=14.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.1685276 = fieldWeight in 3311, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3311)
    0.0017360178 = weight(_text_:s in 3311) [ClassicSimilarity], result of:
      0.0017360178 = score(doc=3311,freq=2.0), product of:
        0.028903782 = queryWeight, product of:
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.026584605 = queryNorm
        0.060061958 = fieldWeight in 3311, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3311)
    0.018714907 = weight(_text_:k in 3311) [ClassicSimilarity], result of:
      0.018714907 = score(doc=3311,freq=2.0), product of:
        0.09490114 = queryWeight, product of:
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.026584605 = queryNorm
        0.19720423 = fieldWeight in 3311, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3311)
  0.27272728 = coord(3/11)

Abstract: Tools for automatic subject assignment help deal with scale and sustainability in creating and enriching metadata, establishing more connections across and between resources and enhancing consistency. Although some software vendors and experimental researchers claim the tools can replace manual subject indexing, hard scientific evidence of their performance in operating information environments is scarce. A major reason for this is that research is usually conducted in laboratory conditions, excluding the complexities of real-life systems and situations. The article reviews and discusses issues with existing evaluation approaches such as problems of aboutness and relevance assessments, implying the need to use more than a single "gold standard" method when evaluating indexing and retrieval, and proposes a comprehensive evaluation framework. The framework is informed by a systematic review of the literature on evaluation approaches: evaluating indexing quality directly through assessment by an evaluator or through comparison with a gold standard, evaluating the quality of computer-assisted indexing directly in the context of an indexing workflow, and evaluating indexing quality indirectly through analyzing retrieval performance.
Source: Journal of the Association for Information Science and Technology. 67(2016) no.1, S.3-16
Type: a

Liu, R.-L.: Context-based term frequency assessment for text classification (2010) 0.01

0.006941656 = product of:
  0.025452739 = sum of:
    0.0040582716 = weight(_text_:a in 3331) [ClassicSimilarity], result of:
      0.0040582716 = score(doc=3331,freq=6.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.13239266 = fieldWeight in 3331, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.046875 = fieldNorm(doc=3331)
    0.019311246 = weight(_text_:r in 3331) [ClassicSimilarity], result of:
      0.019311246 = score(doc=3331,freq=2.0), product of:
        0.088001914 = queryWeight, product of:
          3.3102584 = idf(docFreq=4387, maxDocs=44218)
          0.026584605 = queryNorm
        0.2194412 = fieldWeight in 3331, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.3102584 = idf(docFreq=4387, maxDocs=44218)
          0.046875 = fieldNorm(doc=3331)
    0.0020832212 = weight(_text_:s in 3331) [ClassicSimilarity], result of:
      0.0020832212 = score(doc=3331,freq=2.0), product of:
        0.028903782 = queryWeight, product of:
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.026584605 = queryNorm
        0.072074346 = fieldWeight in 3331, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.046875 = fieldNorm(doc=3331)
  0.27272728 = coord(3/11)

Abstract: Automatic text classification (TC) is essential for the management of information. To properly classify a document d, it is essential to identify the semantics of each term t in d, while the semantics heavily depend on context (neighboring terms) of t in d. Therefore, we present a technique CTFA (Context-based Term Frequency Assessment) that improves text classifiers by considering term contexts in test documents. The results of the term context recognition are used to assess term frequencies of terms, and hence CTFA may easily work with various kinds of text classifiers that base their TC decisions on term frequencies, without needing to modify the classifiers. Moreover, CTFA is efficient, and neither huge memory nor domain-specific knowledge is required. Empirical results show that CTFA successfully enhances performance of several kinds of text classifiers on different experimental data.
Source: Journal of the American Society for Information Science and Technology. 61(2010) no.2, S.300-309
Type: a

Desale, S.K.; Kumbhar, R.: Research on automatic classification of documents in library environment : a literature review (2013) 0.01

0.006941656 = product of:
  0.025452739 = sum of:
    0.0040582716 = weight(_text_:a in 1071) [ClassicSimilarity], result of:
      0.0040582716 = score(doc=1071,freq=6.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.13239266 = fieldWeight in 1071, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.046875 = fieldNorm(doc=1071)
    0.019311246 = weight(_text_:r in 1071) [ClassicSimilarity], result of:
      0.019311246 = score(doc=1071,freq=2.0), product of:
        0.088001914 = queryWeight, product of:
          3.3102584 = idf(docFreq=4387, maxDocs=44218)
          0.026584605 = queryNorm
        0.2194412 = fieldWeight in 1071, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.3102584 = idf(docFreq=4387, maxDocs=44218)
          0.046875 = fieldNorm(doc=1071)
    0.0020832212 = weight(_text_:s in 1071) [ClassicSimilarity], result of:
      0.0020832212 = score(doc=1071,freq=2.0), product of:
        0.028903782 = queryWeight, product of:
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.026584605 = queryNorm
        0.072074346 = fieldWeight in 1071, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.046875 = fieldNorm(doc=1071)
  0.27272728 = coord(3/11)

Abstract: This paper aims to provide an overview of automatic classification research, which focuses on issues related to the automatic classification of documents in a library environment. The review covers literature published in mainstream library and information science studies. The review was done on literature published in both academic and professional LIS journals and other documents. This review reveals that basically three types of research are being done on automatic classification: 1) hierarchical classification using different library classification schemes, 2) text categorization and document categorization using different type of classifiers with or without using training documents, and 3) automatic bibliographic classification. Predominantly this research is directed towards solving problems of organization of digital documents in an online environment. However, very little research is devoted towards solving the problems of arrangement of physical documents.
Source: Knowledge organization. 40(2013) no.5, S.295-304
Type: a

HaCohen-Kerner, Y. et al.: Classification using various machine learning methods and combinations of key-phrases and visual features (2016) 0.01

0.006923549 = product of:
  0.025386345 = sum of:
    0.0039050733 = weight(_text_:a in 2748) [ClassicSimilarity], result of:
      0.0039050733 = score(doc=2748,freq=2.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.12739488 = fieldWeight in 2748, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.078125 = fieldNorm(doc=2748)
    0.0034720355 = weight(_text_:s in 2748) [ClassicSimilarity], result of:
      0.0034720355 = score(doc=2748,freq=2.0), product of:
        0.028903782 = queryWeight, product of:
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.026584605 = queryNorm
        0.120123915 = fieldWeight in 2748, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.078125 = fieldNorm(doc=2748)
    0.018009236 = product of:
      0.036018472 = sum of:
        0.036018472 = weight(_text_:22 in 2748) [ClassicSimilarity], result of:
          0.036018472 = score(doc=2748,freq=2.0), product of:
            0.09309476 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.026584605 = queryNorm
            0.38690117 = fieldWeight in 2748, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.078125 = fieldNorm(doc=2748)
      0.5 = coord(1/2)
  0.27272728 = coord(3/11)

Date: 1. 2.2016 18:25:22
Pages: S.64-75
Type: a

Fagni, T.; Sebastiani, F.: Selecting negative examples for hierarchical text classification: An experimental comparison (2010) 0.01

0.006881903 = product of:
  0.025233643 = sum of:
    0.004782719 = weight(_text_:a in 4101) [ClassicSimilarity], result of:
      0.004782719 = score(doc=4101,freq=12.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.15602624 = fieldWeight in 4101, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4101)
    0.0017360178 = weight(_text_:s in 4101) [ClassicSimilarity], result of:
      0.0017360178 = score(doc=4101,freq=2.0), product of:
        0.028903782 = queryWeight, product of:
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.026584605 = queryNorm
        0.060061958 = fieldWeight in 4101, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4101)
    0.018714907 = weight(_text_:k in 4101) [ClassicSimilarity], result of:
      0.018714907 = score(doc=4101,freq=2.0), product of:
        0.09490114 = queryWeight, product of:
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.026584605 = queryNorm
        0.19720423 = fieldWeight in 4101, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4101)
  0.27272728 = coord(3/11)

Abstract: Hierarchical text classification (HTC) approaches have recently attracted a lot of interest on the part of researchers in human language technology and machine learning, since they have been shown to bring about equal, if not better, classification accuracy with respect to their "flat" counterparts while allowing exponential time savings at both learning and classification time. A typical component of HTC methods is a "local" policy for selecting negative examples: Given a category c, its negative training examples are by default identified with the training examples that are negative for c and positive for the categories which are siblings of c in the hierarchy. However, this policy has always been taken for granted and never been subjected to careful scrutiny since first proposed 15 years ago. This article proposes a thorough experimental comparison between this policy and three other policies for the selection of negative examples in HTC contexts, one of which (BEST LOCAL (k)) is being proposed for the first time in this article. We compare these policies on the hierarchical versions of three supervised learning algorithms (boosting, support vector machines, and naïve Bayes) by performing experiments on two standard TC datasets, REUTERS-21578 and RCV1-V2.
Source: Journal of the American Society for Information Science and Technology. 61(2010) no.11, S.2256-2265
Type: a

Ma, Z.; Sun, A.; Cong, G.: On predicting the popularity of newly emerging hashtags in Twitter (2013) 0.01

0.006881903 = product of:
  0.025233643 = sum of:
    0.004782719 = weight(_text_:a in 967) [ClassicSimilarity], result of:
      0.004782719 = score(doc=967,freq=12.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.15602624 = fieldWeight in 967, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.0390625 = fieldNorm(doc=967)
    0.0017360178 = weight(_text_:s in 967) [ClassicSimilarity], result of:
      0.0017360178 = score(doc=967,freq=2.0), product of:
        0.028903782 = queryWeight, product of:
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.026584605 = queryNorm
        0.060061958 = fieldWeight in 967, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.0390625 = fieldNorm(doc=967)
    0.018714907 = weight(_text_:k in 967) [ClassicSimilarity], result of:
      0.018714907 = score(doc=967,freq=2.0), product of:
        0.09490114 = queryWeight, product of:
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.026584605 = queryNorm
        0.19720423 = fieldWeight in 967, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.0390625 = fieldNorm(doc=967)
  0.27272728 = coord(3/11)

Abstract: Because of Twitter's popularity and the viral nature of information dissemination on Twitter, predicting which Twitter topics will become popular in the near future becomes a task of considerable economic importance. Many Twitter topics are annotated by hashtags. In this article, we propose methods to predict the popularity of new hashtags on Twitter by formulating the problem as a classification task. We use five standard classification models (i.e., Naïve bayes, k-nearest neighbors, decision trees, support vector machines, and logistic regression) for prediction. The main challenge is the identification of effective features for describing new hashtags. We extract 7 content features from a hashtag string and the collection of tweets containing the hashtag and 11 contextual features from the social graph formed by users who have adopted the hashtag. We conducted experiments on a Twitter data set consisting of 31 million tweets from 2 million Singapore-based users. The experimental results show that the standard classifiers using the extracted features significantly outperform the baseline methods that do not use these features. Among the five classifiers, the logistic regression model performs the best in terms of the Micro-F1 measure. We also observe that contextual features are more effective than content features.
Source: Journal of the American Society for Information Science and Technology. 64(2013) no.7, S.1399-1410
Type: a

Yang, P.; Gao, W.; Tan, Q.; Wong, K.-F.: ¬A link-bridged topic model for cross-domain document classification (2013) 0.01
```
0.006768254 = product of:
  0.02481693 = sum of:
    0.0043660053 = weight(_text_:a in 2706) [ClassicSimilarity], result of:
      0.0043660053 = score(doc=2706,freq=10.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.14243183 = fieldWeight in 2706, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2706)
    0.0017360178 = weight(_text_:s in 2706) [ClassicSimilarity], result of:
      0.0017360178 = score(doc=2706,freq=2.0), product of:
        0.028903782 = queryWeight, product of:
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.026584605 = queryNorm
        0.060061958 = fieldWeight in 2706, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2706)
    0.018714907 = weight(_text_:k in 2706) [ClassicSimilarity], result of:
      0.018714907 = score(doc=2706,freq=2.0), product of:
        0.09490114 = queryWeight, product of:
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.026584605 = queryNorm
        0.19720423 = fieldWeight in 2706, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.569778 = idf(docFreq=3384, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2706)
  0.27272728 = coord(3/11)
```
Abstract

Transfer learning utilizes labeled data available from some related domain (source domain) for achieving effective knowledge transformation to the target domain. However, most state-of-the-art cross-domain classification methods treat documents as plain text and ignore the hyperlink (or citation) relationship existing among the documents. In this paper, we propose a novel cross-domain document classification approach called Link-Bridged Topic model (LBT). LBT consists of two key steps. Firstly, LBT utilizes an auxiliary link network to discover the direct or indirect co-citation relationship among documents by embedding the background knowledge into a graph kernel. The mined co-citation relationship is leveraged to bridge the gap across different domains. Secondly, LBT simultaneously combines the content information and link structures into a unified latent topic model. The model is based on an assumption that the documents of source and target domains share some common topics from the point of view of both content information and link structure. By mapping both domains data into the latent topic spaces, LBT encodes the knowledge about domain commonality and difference as the shared topics with associated differential probabilities. The learned latent topics must be consistent with the source and target data, as well as content and link statistics. Then the shared topics act as the bridge to facilitate knowledge transfer from the source to the target domains. Experiments on different types of datasets show that our algorithm significantly improves the generalization performance of cross-domain document classification.

Source

Information processing and management. 49(2013) no.6, S.1181-1193

Type

a
Ru, C.; Tang, J.; Li, S.; Xie, S.; Wang, T.: Using semantic similarity to reduce wrong labels in distant supervision for relation extraction (2018) 0.01
```
0.006418899 = product of:
  0.023535963 = sum of:
    0.004782719 = weight(_text_:a in 5055) [ClassicSimilarity], result of:
      0.004782719 = score(doc=5055,freq=12.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.15602624 = fieldWeight in 5055, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5055)
    0.003006871 = weight(_text_:s in 5055) [ClassicSimilarity], result of:
      0.003006871 = score(doc=5055,freq=6.0), product of:
        0.028903782 = queryWeight, product of:
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.026584605 = queryNorm
        0.10403037 = fieldWeight in 5055, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5055)
    0.015746372 = weight(_text_:u in 5055) [ClassicSimilarity], result of:
      0.015746372 = score(doc=5055,freq=2.0), product of:
        0.08704981 = queryWeight, product of:
          3.2744443 = idf(docFreq=4547, maxDocs=44218)
          0.026584605 = queryNorm
        0.1808892 = fieldWeight in 5055, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2744443 = idf(docFreq=4547, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5055)
  0.27272728 = coord(3/11)
```
Abstract

Distant supervision (DS) has the advantage of automatically generating large amounts of labelled training data and has been widely used for relation extraction. However, there are usually many wrong labels in the automatically labelled data in distant supervision (Riedel, Yao, & McCallum, 2010). This paper presents a novel method to reduce the wrong labels. The proposed method uses the semantic Jaccard with word embedding to measure the semantic similarity between the relation phrase in the knowledge base and the dependency phrases between two entities in a sentence to filter the wrong labels. In the process of reducing wrong labels, the semantic Jaccard algorithm selects a core dependency phrase to represent the candidate relation in a sentence, which can capture features for relation classification and avoid the negative impact from irrelevant term sequences that previous neural network models of relation extraction often suffer. In the process of relation classification, the core dependency phrases are also used as the input of a convolutional neural network (CNN) for relation classification. The experimental results show that compared with the methods using original DS data, the methods using filtered DS data performed much better in relation extraction. It indicates that the semantic similarity based method is effective in reducing wrong labels. The relation extraction performance of the CNN model using the core dependency phrases as input is the best of all, which indicates that using the core dependency phrases as input of CNN is enough to capture the features for relation classification and could avoid negative impact from irrelevant terms.

Source

Information processing and management. 54(2018) no.4, S.593-608

Theme

Semantisches Umfeld in Indexierung u. Retrieval

Type

a
Yilmaz, T.; Ozcan, R.; Altingovde, I.S.; Ulusoy, Ö.: Improving educational web search for question-like queries through subject classification (2019) 0.01
```
0.0060531073 = product of:
  0.022194726 = sum of:
    0.0043660053 = weight(_text_:a in 5041) [ClassicSimilarity], result of:
      0.0043660053 = score(doc=5041,freq=10.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.14243183 = fieldWeight in 5041, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5041)
    0.016092705 = weight(_text_:r in 5041) [ClassicSimilarity], result of:
      0.016092705 = score(doc=5041,freq=2.0), product of:
        0.088001914 = queryWeight, product of:
          3.3102584 = idf(docFreq=4387, maxDocs=44218)
          0.026584605 = queryNorm
        0.18286766 = fieldWeight in 5041, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.3102584 = idf(docFreq=4387, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5041)
    0.0017360178 = weight(_text_:s in 5041) [ClassicSimilarity], result of:
      0.0017360178 = score(doc=5041,freq=2.0), product of:
        0.028903782 = queryWeight, product of:
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.026584605 = queryNorm
        0.060061958 = fieldWeight in 5041, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5041)
  0.27272728 = coord(3/11)
```
Abstract

Students use general web search engines as their primary source of research while trying to find answers to school-related questions. Although search engines are highly relevant for the general population, they may return results that are out of educational context. Another rising trend; social community question answering websites are the second choice for students who try to get answers from other peers online. We attempt discovering possible improvements in educational search by leveraging both of these information sources. For this purpose, we first implement a classifier for educational questions. This classifier is built by an ensemble method that employs several regular learning algorithms and retrieval based approaches that utilize external resources. We also build a query expander to facilitate classification. We further improve the classification using search engine results and obtain 83.5% accuracy. Although our work is entirely based on the Turkish language, the features could easily be mapped to other languages as well. In order to find out whether search engine ranking can be improved in the education domain using the classification model, we collect and label a set of query results retrieved from a general web search engine. We propose five ad-hoc methods to improve search ranking based on the idea that the query-document category relation is an indicator of relevance. We evaluate these methods for overall performance, varying query length and based on factoid and non-factoid queries. We show that some of the methods significantly improve the rankings in the education domain.

Source

Information processing and management. 56(2019) no.1, S.228-246

Type

a

Egbert, J.; Biber, D.; Davies, M.: Developing a bottom-up, user-based method of web register classification (2015) 0.01

0.00563448 = product of:
  0.02065976 = sum of:
    0.007770999 = weight(_text_:a in 2158) [ClassicSimilarity], result of:
      0.007770999 = score(doc=2158,freq=22.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.25351265 = fieldWeight in 2158, product of:
          4.690416 = tf(freq=22.0), with freq of:
            22.0 = termFreq=22.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.046875 = fieldNorm(doc=2158)
    0.0020832212 = weight(_text_:s in 2158) [ClassicSimilarity], result of:
      0.0020832212 = score(doc=2158,freq=2.0), product of:
        0.028903782 = queryWeight, product of:
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.026584605 = queryNorm
        0.072074346 = fieldWeight in 2158, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.046875 = fieldNorm(doc=2158)
    0.010805541 = product of:
      0.021611081 = sum of:
        0.021611081 = weight(_text_:22 in 2158) [ClassicSimilarity], result of:
          0.021611081 = score(doc=2158,freq=2.0), product of:
            0.09309476 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.026584605 = queryNorm
            0.23214069 = fieldWeight in 2158, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=2158)
      0.5 = coord(1/2)
  0.27272728 = coord(3/11)

Abstract: This paper introduces a project to develop a reliable, cost-effective method for classifying Internet texts into register categories, and apply that approach to the analysis of a large corpus of web documents. To date, the project has proceeded in 2 key phases. First, we developed a bottom-up method for web register classification, asking end users of the web to utilize a decision-tree survey to code relevant situational characteristics of web documents, resulting in a bottom-up identification of register and subregister categories. We present details regarding the development and testing of this method through a series of 10 pilot studies. Then, in the second phase of our project we applied this procedure to a corpus of 53,000 web documents. An analysis of the results demonstrates the effectiveness of these methods for web register classification and provides a preliminary description of the types and distribution of registers on the web.
Date: 4. 8.2015 19:22:04
Source: Journal of the Association for Information Science and Technology. 66(2015) no.9, S.1817-1831
Type: a

Piros, A.: Automatic interpretation of complex UDC numbers : towards support for library systems (2015) 0.01
```
0.005019272 = product of:
  0.018403998 = sum of:
    0.0044180867 = weight(_text_:a in 2301) [ClassicSimilarity], result of:
      0.0044180867 = score(doc=2301,freq=16.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.14413087 = fieldWeight in 2301, product of:
          4.0 = tf(freq=16.0), with freq of:
            16.0 = termFreq=16.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.03125 = fieldNorm(doc=2301)
    0.0013888142 = weight(_text_:s in 2301) [ClassicSimilarity], result of:
      0.0013888142 = score(doc=2301,freq=2.0), product of:
        0.028903782 = queryWeight, product of:
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.026584605 = queryNorm
        0.048049565 = fieldWeight in 2301, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.03125 = fieldNorm(doc=2301)
    0.012597097 = weight(_text_:u in 2301) [ClassicSimilarity], result of:
      0.012597097 = score(doc=2301,freq=2.0), product of:
        0.08704981 = queryWeight, product of:
          3.2744443 = idf(docFreq=4547, maxDocs=44218)
          0.026584605 = queryNorm
        0.14471136 = fieldWeight in 2301, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2744443 = idf(docFreq=4547, maxDocs=44218)
          0.03125 = fieldNorm(doc=2301)
  0.27272728 = coord(3/11)
```
Abstract

Analytico-synthetic and faceted classifications, such as Universal Decimal Classification (UDC) express content of documents with complex, pre-combined classification codes. Without classification authority control that would help manage and access structured notations, the use of UDC codes in searching and browsing is limited. Existing UDC parsing solutions are usually created for a particular database system or a specific task and are not widely applicable. The approach described in this paper provides a solution by which the analysis and interpretation of UDC notations would be stored into an intermediate format (in this case, in XML) by automatic means without any data or information loss. Due to its richness, the output file can be converted into different formats, such as standard mark-up and data exchange formats or simple lists of the recommended entry points of a UDC number. The program can also be used to create authority records containing complex UDC numbers which can be comprehensively analysed in order to be retrieved effectively. The Java program, as well as the corresponding schema definition it employs, is under continuous development. The current version of the interpreter software is now available online for testing purposes at the following web site: http://interpreter-eto.rhcloud.com. The future plan is to implement conversion methods for standard formats and to create standard online interfaces in order to make it possible to use the features of software as a service. This would result in the algorithm being able to be employed both in existing and future library systems to analyse UDC numbers without any significant programming effort.

Pages

S.177-194

Source

Classification and authority control: expanding resource discovery: proceedings of the International UDC Seminar 2015, 29-30 October 2015, Lisbon, Portugal. Eds.: Slavic, A. u. M.I. Cordeiro

Type

a

Teich, E.; Degaetano-Ortlieb, S.; Fankhauser, P.; Kermes, H.; Lapshinova-Koltunski, E.: ¬The linguistic construal of disciplinarity : a data-mining approach using register features (2016) 0.00

0.003564867 = product of:
  0.013071178 = sum of:
    0.0054389704 = product of:
      0.010877941 = sum of:
        0.010877941 = weight(_text_:h in 3015) [ClassicSimilarity], result of:
          0.010877941 = score(doc=3015,freq=2.0), product of:
            0.0660481 = queryWeight, product of:
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.026584605 = queryNorm
            0.16469726 = fieldWeight in 3015, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.046875 = fieldNorm(doc=3015)
      0.5 = coord(1/2)
    0.0046860883 = weight(_text_:a in 3015) [ClassicSimilarity], result of:
      0.0046860883 = score(doc=3015,freq=8.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.15287387 = fieldWeight in 3015, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.046875 = fieldNorm(doc=3015)
    0.00294612 = weight(_text_:s in 3015) [ClassicSimilarity], result of:
      0.00294612 = score(doc=3015,freq=4.0), product of:
        0.028903782 = queryWeight, product of:
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.026584605 = queryNorm
        0.101928525 = fieldWeight in 3015, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.046875 = fieldNorm(doc=3015)
  0.27272728 = coord(3/11)

Abstract: We analyze the linguistic evolution of selected scientific disciplines over a 30-year time span (1970s to 2000s). Our focus is on four highly specialized disciplines at the boundaries of computer science that emerged during that time: computational linguistics, bioinformatics, digital construction, and microelectronics. Our analysis is driven by the question whether these disciplines develop a distinctive language use-both individually and collectively-over the given time period. The data set is the English Scientific Text Corpus (scitex), which includes texts from the 1970s/1980s and early 2000s. Our theoretical basis is register theory. In terms of methods, we combine corpus-based methods of feature extraction (various aggregated features [part-of-speech based], n-grams, lexico-grammatical patterns) and automatic text classification. The results of our research are directly relevant to the study of linguistic variation and languages for specific purposes (LSP) and have implications for various natural language processing (NLP) tasks, for example, authorship attribution, text mining, or training NLP tools.
Source: Journal of the Association for Information Science and Technology. 67(2016) no.7, S.1668-1678
Type: a

Suominen, A.; Toivanen, H.: Map of science with topic modeling : comparison of unsupervised learning and human-assigned subject classification (2016) 0.00

0.003013967 = product of:
  0.011051212 = sum of:
    0.004532476 = product of:
      0.009064952 = sum of:
        0.009064952 = weight(_text_:h in 3121) [ClassicSimilarity], result of:
          0.009064952 = score(doc=3121,freq=2.0), product of:
            0.0660481 = queryWeight, product of:
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.026584605 = queryNorm
            0.13724773 = fieldWeight in 3121, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.4844491 = idf(docFreq=10020, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3121)
      0.5 = coord(1/2)
    0.004782719 = weight(_text_:a in 3121) [ClassicSimilarity], result of:
      0.004782719 = score(doc=3121,freq=12.0), product of:
        0.030653298 = queryWeight, product of:
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.026584605 = queryNorm
        0.15602624 = fieldWeight in 3121, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          1.153047 = idf(docFreq=37942, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3121)
    0.0017360178 = weight(_text_:s in 3121) [ClassicSimilarity], result of:
      0.0017360178 = score(doc=3121,freq=2.0), product of:
        0.028903782 = queryWeight, product of:
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.026584605 = queryNorm
        0.060061958 = fieldWeight in 3121, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.0872376 = idf(docFreq=40523, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3121)
  0.27272728 = coord(3/11)

Abstract: The delineation of coordinates is fundamental for the cartography of science, and accurate and credible classification of scientific knowledge presents a persistent challenge in this regard. We present a map of Finnish science based on unsupervised-learning classification, and discuss the advantages and disadvantages of this approach vis-à-vis those generated by human reasoning. We conclude that from theoretical and practical perspectives there exist several challenges for human reasoning-based classification frameworks of scientific knowledge, as they typically try to fit new-to-the-world knowledge into historical models of scientific knowledge, and cannot easily be deployed for new large-scale data sets. Automated classification schemes, in contrast, generate classification models only from the available text corpus, thereby identifying credibly novel bodies of knowledge. They also lend themselves to versatile large-scale data analysis, and enable a range of Big Data possibilities. However, we also argue that it is neither possible nor fruitful to declare one or another method a superior approach in terms of realism to classify scientific knowledge, and we believe that the merits of each approach are dependent on the practical objectives of analysis.
Source: Journal of the Association for Information Science and Technology. 67(2016) no.10, S.2464-2476
Type: a

Search (47 results, page 1 of 3)

Authors

Languages

Types

Themes