Search (245 results, page 2 of 13)

  • × theme_ss:"Automatisches Indexieren"
  • × type_ss:"a"
  1. Damerau, F.J.: Generating an evaluating domain-oriented multi-word terms from texts (1993) 0.01
    0.01254489 = product of:
      0.031362224 = sum of:
        0.0133040715 = product of:
          0.066520356 = sum of:
            0.066520356 = weight(_text_:problem in 5814) [ClassicSimilarity], result of:
              0.066520356 = score(doc=5814,freq=2.0), product of:
                0.17731056 = queryWeight, product of:
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.04177434 = queryNorm
                0.375163 = fieldWeight in 5814, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.0625 = fieldNorm(doc=5814)
          0.2 = coord(1/5)
        0.018058153 = weight(_text_:of in 5814) [ClassicSimilarity], result of:
          0.018058153 = score(doc=5814,freq=8.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.27643585 = fieldWeight in 5814, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0625 = fieldNorm(doc=5814)
      0.4 = coord(2/5)
    
    Abstract
    Examines techniques for automatically generating domain vocabularies from large text collections. Focuses on the problem of generating multi-word vocabulary terms (specifically pairs). Discusses statistical issues associated with word co-occurrences likely to be of use in a natural language interface. Provides a more objective evaluation of the selection procedures. As substantial experimentation with subjects using a working query system is absent, all evaluation is necessarily subjective. Uses surrogate for experimentation by relying on pre-existing dictionaries as indicators of domain relevance
  2. Wolfekuhler, M.R.; Punch, W.F.: Finding salient features for personal Web pages categories (1997) 0.01
    0.01239295 = product of:
      0.030982375 = sum of:
        0.011172912 = weight(_text_:of in 2673) [ClassicSimilarity], result of:
          0.011172912 = score(doc=2673,freq=4.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.17103596 = fieldWeight in 2673, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2673)
        0.019809462 = product of:
          0.039618924 = sum of:
            0.039618924 = weight(_text_:22 in 2673) [ClassicSimilarity], result of:
              0.039618924 = score(doc=2673,freq=2.0), product of:
                0.14628662 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04177434 = queryNorm
                0.2708308 = fieldWeight in 2673, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=2673)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    Examines techniques that discover features in sets of pre-categorized documents, such that similar documents can be found on the WWW. Examines techniques which will classifiy training examples with high accuracy, then explains why this is not necessarily useful. Describes a method for extracting word clusters from the raw document features. Results show that the clustering technique is successful in discovering word groups in personal Web pages which can be used to find similar information on the WWW
    Date
    1. 8.1996 22:08:06
    Footnote
    Contribution to a special issue of papers from the 6th International World Wide Web conference, held 7-11 Apr 1997, Santa Clara, California
  3. Zhang, Y.; Zhang, C.; Li, J.: Joint modeling of characters, words, and conversation contexts for microblog keyphrase extraction (2020) 0.01
    0.011841811 = product of:
      0.029604528 = sum of:
        0.0117592495 = product of:
          0.058796246 = sum of:
            0.058796246 = weight(_text_:problem in 5816) [ClassicSimilarity], result of:
              0.058796246 = score(doc=5816,freq=4.0), product of:
                0.17731056 = queryWeight, product of:
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.04177434 = queryNorm
                0.33160037 = fieldWeight in 5816, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5816)
          0.2 = coord(1/5)
        0.017845279 = weight(_text_:of in 5816) [ClassicSimilarity], result of:
          0.017845279 = score(doc=5816,freq=20.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.27317715 = fieldWeight in 5816, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5816)
      0.4 = coord(2/5)
    
    Abstract
    Millions of messages are produced on microblog platforms every day, leading to the pressing need for automatic identification of key points from the massive texts. To absorb salient content from the vast bulk of microblog posts, this article focuses on the task of microblog keyphrase extraction. In previous work, most efforts treat messages as independent documents and might suffer from the data sparsity problem exhibited in short and informal microblog posts. On the contrary, we propose to enrich contexts via exploiting conversations initialized by target posts and formed by their replies, which are generally centered around relevant topics to the target posts and therefore helpful for keyphrase identification. Concretely, we present a neural keyphrase extraction framework, which has 2 modules: a conversation context encoder and a keyphrase tagger. The conversation context encoder captures indicative representation from their conversation contexts and feeds the representation into the keyphrase tagger, and the keyphrase tagger extracts salient words from target posts. The 2 modules were trained jointly to optimize the conversation context encoding and keyphrase extraction processes. In the conversation context encoder, we leverage hierarchical structures to capture the word-level indicative representation and message-level indicative representation hierarchically. In both of the modules, we apply character-level representations, which enables the model to explore morphological features and deal with the out-of-vocabulary problem caused by the informal language style of microblog messages. Extensive comparison results on real-life data sets indicate that our model outperforms state-of-the-art models from previous studies.
    Source
    Journal of the Association for Information Science and Technology. 71(2020) no.5, S.553-567
  4. Bloomfield, M.: Indexing : neglected and poorly understood (2001) 0.01
    0.0116526475 = product of:
      0.029131617 = sum of:
        0.009978054 = product of:
          0.04989027 = sum of:
            0.04989027 = weight(_text_:problem in 5439) [ClassicSimilarity], result of:
              0.04989027 = score(doc=5439,freq=2.0), product of:
                0.17731056 = queryWeight, product of:
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.04177434 = queryNorm
                0.28137225 = fieldWeight in 5439, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.046875 = fieldNorm(doc=5439)
          0.2 = coord(1/5)
        0.019153563 = weight(_text_:of in 5439) [ClassicSimilarity], result of:
          0.019153563 = score(doc=5439,freq=16.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.2932045 = fieldWeight in 5439, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=5439)
      0.4 = coord(2/5)
    
    Abstract
    The growth of the Internet has highlighted the use of machine indexing. The difficulties in using the Internet as a searching device can be frustrating. The use of the term "Python" is given as an example. Machine indexing is noted as "rotten" and human indexing as "capricious." The problem seems to be a lack of a theoretical foundation for the art of indexing. What librarians have learned over the last hundred years has yet to yield a consistent approach to what really works best in preparing index terms and in the ability of our customers to search the various indexes. An attempt is made to consider the elements of indexing, their pros and cons. The argument is made that machine indexing is far too prolific in its production of index terms. Neither librarians nor computer programmers have made much progress to improve Internet indexing. Human indexing has had the same problems for over fifty years.
  5. Salton, G.: Automatic processing of foreign language documents (1985) 0.01
    0.011507466 = product of:
      0.028768666 = sum of:
        0.0066520358 = product of:
          0.033260178 = sum of:
            0.033260178 = weight(_text_:problem in 3650) [ClassicSimilarity], result of:
              0.033260178 = score(doc=3650,freq=2.0), product of:
                0.17731056 = queryWeight, product of:
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.04177434 = queryNorm
                0.1875815 = fieldWeight in 3650, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.03125 = fieldNorm(doc=3650)
          0.2 = coord(1/5)
        0.02211663 = weight(_text_:of in 3650) [ClassicSimilarity], result of:
          0.02211663 = score(doc=3650,freq=48.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.33856338 = fieldWeight in 3650, product of:
              6.928203 = tf(freq=48.0), with freq of:
                48.0 = termFreq=48.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.03125 = fieldNorm(doc=3650)
      0.4 = coord(2/5)
    
    Abstract
    The attempt to computerize a process, such as indexing, abstracting, classifying, or retrieving information, begins with an analysis of the process into its intellectual and nonintellectual components. That part of the process which is amenable to computerization is mechanical or algorithmic. What is not is intellectual or creative and requires human intervention. Gerard Salton has been an innovator, experimenter, and promoter in the area of mechanized information systems since the early 1960s. He has been particularly ingenious at analyzing the process of information retrieval into its algorithmic components. He received a doctorate in applied mathematics from Harvard University before moving to the computer science department at Cornell, where he developed a prototype automatic retrieval system called SMART. Working with this system he and his students contributed for over a decade to our theoretical understanding of the retrieval process. On a more practical level, they have contributed design criteria for operating retrieval systems. The following selection presents one of the early descriptions of the SMART system; it is valuable as it shows the direction automatic retrieval methods were to take beyond simple word-matching techniques. These include various word normalization techniques to improve recall, for instance, the separation of words into stems and affixes; the correlation and clustering, using statistical association measures, of related terms; and the identification, using a concept thesaurus, of synonymous, broader, narrower, and sibling terms. They include, as weIl, techniques, both linguistic and statistical, to deal with the thorny problem of how to automatically extract from texts index terms that consist of more than one word. They include weighting techniques and various documentrequest matching algorithms. Significant among the latter are those which produce a retrieval output of citations ranked in relevante order. During the 1970s, Salton and his students went an to further refine these various techniques, particularly the weighting and statistical association measures. Many of their early innovations seem commonplace today. Some of their later techniques are still ahead of their time and await technological developments for implementation. The particular focus of the selection that follows is an the evaluation of a particular component of the SMART system, a multilingual thesaurus. By mapping English language expressions and their German equivalents to a common concept number, the thesaurus permitted the automatic processing of German language documents against English language queries and vice versa. The results of the evaluation, as it turned out, were somewhat inconclusive. However, this SMART experiment suggested in a bold and optimistic way how one might proceed to answer such complex questions as What is meant by retrieval language compatability? How it is to be achieved, and how evaluated?
    Footnote
    Original in: Journal of the American Society for Information Science 21(1970) no.3, S.187-194.
    Source
    Theory of subject analysis: a sourcebook. Ed.: L.M. Chan, et al
  6. Blank, I.; Rokach, L.; Shani, G.: Leveraging metadata to recommend keywords for academic papers (2016) 0.01
    0.011088221 = product of:
      0.027720552 = sum of:
        0.0117592495 = product of:
          0.058796246 = sum of:
            0.058796246 = weight(_text_:problem in 3232) [ClassicSimilarity], result of:
              0.058796246 = score(doc=3232,freq=4.0), product of:
                0.17731056 = queryWeight, product of:
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.04177434 = queryNorm
                0.33160037 = fieldWeight in 3232, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3232)
          0.2 = coord(1/5)
        0.015961302 = weight(_text_:of in 3232) [ClassicSimilarity], result of:
          0.015961302 = score(doc=3232,freq=16.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.24433708 = fieldWeight in 3232, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3232)
      0.4 = coord(2/5)
    
    Abstract
    Users of research databases, such as CiteSeerX, Google Scholar, and Microsoft Academic, often search for papers using a set of keywords. Unfortunately, many authors avoid listing sufficient keywords for their papers. As such, these applications may need to automatically associate good descriptive keywords with papers. When the full text of the paper is available this problem has been thoroughly studied. In many cases, however, due to copyright limitations, research databases do not have access to the full text. On the other hand, such databases typically maintain metadata, such as the title and abstract and the citation network of each paper. In this paper we study the problem of predicting which keywords are appropriate for a research paper, using different methods based on the citation network and available metadata. Our main goal is in providing search engines with the ability to extract keywords from the available metadata. However, our system can also be used for other applications, such as for recommending keywords for the authors of new papers. We create a data set of research papers, and their citation network, keywords, and other metadata, containing over 470K papers with and more than 2 million keywords. We compare our methods with predicting keywords using the title and abstract, in offline experiments and in a user study, concluding that the citation network provides much better predictions.
    Source
    Journal of the Association for Information Science and Technology. 67(2016) no.12, S.3073-3091
  7. Martins, A.L.; Souza, R.R.; Ribeiro de Mello, H.: ¬The use of noun phrases in information retrieval : proposing a mechanism for automatic classification (2014) 0.01
    0.011038837 = product of:
      0.027597092 = sum of:
        0.016277399 = weight(_text_:of in 1441) [ClassicSimilarity], result of:
          0.016277399 = score(doc=1441,freq=26.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.2491759 = fieldWeight in 1441, product of:
              5.0990195 = tf(freq=26.0), with freq of:
                26.0 = termFreq=26.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.03125 = fieldNorm(doc=1441)
        0.011319693 = product of:
          0.022639386 = sum of:
            0.022639386 = weight(_text_:22 in 1441) [ClassicSimilarity], result of:
              0.022639386 = score(doc=1441,freq=2.0), product of:
                0.14628662 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04177434 = queryNorm
                0.15476047 = fieldWeight in 1441, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.03125 = fieldNorm(doc=1441)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    This paper presents a research on syntactic structures known as noun phrases (NP) being applied to increase the effectiveness and efficiency of the mechanisms for the document's classification. Our hypothesis is the fact that the NP can be used instead of single words as a semantic aggregator to reduce the number of words that will be used for the classification system without losing its semantic coverage, increasing its efficiency. The experiment divided the documents classification process in three phases: a) NP preprocessing b) system training; and c) classification experiments. In the first step, a corpus of digitalized texts was submitted to a natural language processing platform1 in which the part-of-speech tagging was done, and them PERL scripts pertaining to the PALAVRAS package were used to extract the Noun Phrases. The preprocessing also involved the tasks of a) removing NP low meaning pre-modifiers, as quantifiers; b) identification of synonyms and corresponding substitution for common hyperonyms; and c) stemming of the relevant words contained in the NP, for similitude checking with other NPs. The first tests with the resulting documents have demonstrated its effectiveness. We have compared the structural similarity of the documents before and after the whole pre-processing steps of phase one. The texts maintained the consistency with the original and have kept the readability. The second phase involves submitting the modified documents to a SVM algorithm to identify clusters and classify the documents. The classification rules are to be established using a machine learning approach. Finally, tests will be conducted to check the effectiveness of the whole process.
    Source
    Knowledge organization in the 21st century: between historical patterns and future prospects. Proceedings of the Thirteenth International ISKO Conference 19-22 May 2014, Kraków, Poland. Ed.: Wieslaw Babik
  8. Mansour, N.; Haraty, R.A.; Daher, W.; Houri, M.: ¬An auto-indexing method for Arabic text (2008) 0.01
    0.010626211 = product of:
      0.026565526 = sum of:
        0.009978054 = product of:
          0.04989027 = sum of:
            0.04989027 = weight(_text_:problem in 2103) [ClassicSimilarity], result of:
              0.04989027 = score(doc=2103,freq=2.0), product of:
                0.17731056 = queryWeight, product of:
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.04177434 = queryNorm
                0.28137225 = fieldWeight in 2103, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2103)
          0.2 = coord(1/5)
        0.016587472 = weight(_text_:of in 2103) [ClassicSimilarity], result of:
          0.016587472 = score(doc=2103,freq=12.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.25392252 = fieldWeight in 2103, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=2103)
      0.4 = coord(2/5)
    
    Abstract
    This work addresses the information retrieval problem of auto-indexing Arabic documents. Auto-indexing a text document refers to automatically extracting words that are suitable for building an index for the document. In this paper, we propose an auto-indexing method for Arabic text documents. This method is mainly based on morphological analysis and on a technique for assigning weights to words. The morphological analysis uses a number of grammatical rules to extract stem words that become candidate index words. The weight assignment technique computes weights for these words relative to the container document. The weight is based on how spread is the word in a document and not only on its rate of occurrence. The candidate index words are then sorted in descending order by weight so that information retrievers can select the more important index words. We empirically verify the usefulness of our method using several examples. For these examples, we obtained an average recall of 46% and an average precision of 64%.
  9. Wang, S.; Koopman, R.: Embed first, then predict (2019) 0.01
    0.010232857 = product of:
      0.025582144 = sum of:
        0.0117592495 = product of:
          0.058796246 = sum of:
            0.058796246 = weight(_text_:problem in 5400) [ClassicSimilarity], result of:
              0.058796246 = score(doc=5400,freq=4.0), product of:
                0.17731056 = queryWeight, product of:
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.04177434 = queryNorm
                0.33160037 = fieldWeight in 5400, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5400)
          0.2 = coord(1/5)
        0.013822895 = weight(_text_:of in 5400) [ClassicSimilarity], result of:
          0.013822895 = score(doc=5400,freq=12.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.21160212 = fieldWeight in 5400, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5400)
      0.4 = coord(2/5)
    
    Abstract
    Automatic subject prediction is a desirable feature for modern digital library systems, as manual indexing can no longer cope with the rapid growth of digital collections. It is also desirable to be able to identify a small set of entities (e.g., authors, citations, bibliographic records) which are most relevant to a query. This gets more difficult when the amount of data increases dramatically. Data sparsity and model scalability are the major challenges to solving this type of extreme multilabel classification problem automatically. In this paper, we propose to address this problem in two steps: we first embed different types of entities into the same semantic space, where similarity could be computed easily; second, we propose a novel non-parametric method to identify the most relevant entities in addition to direct semantic similarities. We show how effectively this approach predicts even very specialised subjects, which are associated with few documents in the training set and are more problematic for a classifier.
    Footnote
    Beitrag eines Special Issue: Research Information Systems and Science Classifications; including papers from "Trajectories for Research: Fathoming the Promise of the NARCIS Classification," 27-28 September 2018, The Hague, The Netherlands.
  10. Snajder, J.; Dalbelo Basic, B.D.; Tadic, M.: Automatic acquisition of inflectional lexica for morphological normalisation (2008) 0.01
    0.010048111 = product of:
      0.025120277 = sum of:
        0.009978054 = product of:
          0.04989027 = sum of:
            0.04989027 = weight(_text_:problem in 2910) [ClassicSimilarity], result of:
              0.04989027 = score(doc=2910,freq=2.0), product of:
                0.17731056 = queryWeight, product of:
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.04177434 = queryNorm
                0.28137225 = fieldWeight in 2910, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2910)
          0.2 = coord(1/5)
        0.015142222 = weight(_text_:of in 2910) [ClassicSimilarity], result of:
          0.015142222 = score(doc=2910,freq=10.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.23179851 = fieldWeight in 2910, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=2910)
      0.4 = coord(2/5)
    
    Abstract
    Due to natural language morphology, words can take on various morphological forms. Morphological normalisation - often used in information retrieval and text mining systems - conflates morphological variants of a word to a single representative form. In this paper, we describe an approach to lexicon-based inflectional normalisation. This approach is in between stemming and lemmatisation, and is suitable for morphological normalisation of inflectionally complex languages. To eliminate the immense effort required to compile the lexicon by hand, we focus on the problem of acquiring automatically an inflectional morphological lexicon from raw corpora. We propose a convenient and highly expressive morphology representation formalism on which the acquisition procedure is based. Our approach is applied to the morphologically complex Croatian language, but it should be equally applicable to other languages of similar morphological complexity. Experimental results show that our approach can be used to acquire a lexicon whose linguistic quality allows for rather good normalisation performance.
  11. Greiner-Petter, A.; Schubotz, M.; Cohl, H.S.; Gipp, B.: Semantic preserving bijective mappings for expressions involving special functions between computer algebra systems and document preparation systems (2019) 0.01
    0.009945323 = product of:
      0.024863306 = sum of:
        0.013543614 = weight(_text_:of in 5499) [ClassicSimilarity], result of:
          0.013543614 = score(doc=5499,freq=18.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.20732687 = fieldWeight in 5499, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.03125 = fieldNorm(doc=5499)
        0.011319693 = product of:
          0.022639386 = sum of:
            0.022639386 = weight(_text_:22 in 5499) [ClassicSimilarity], result of:
              0.022639386 = score(doc=5499,freq=2.0), product of:
                0.14628662 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04177434 = queryNorm
                0.15476047 = fieldWeight in 5499, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.03125 = fieldNorm(doc=5499)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    Purpose Modern mathematicians and scientists of math-related disciplines often use Document Preparation Systems (DPS) to write and Computer Algebra Systems (CAS) to calculate mathematical expressions. Usually, they translate the expressions manually between DPS and CAS. This process is time-consuming and error-prone. The purpose of this paper is to automate this translation. This paper uses Maple and Mathematica as the CAS, and LaTeX as the DPS. Design/methodology/approach Bruce Miller at the National Institute of Standards and Technology (NIST) developed a collection of special LaTeX macros that create links from mathematical symbols to their definitions in the NIST Digital Library of Mathematical Functions (DLMF). The authors are using these macros to perform rule-based translations between the formulae in the DLMF and CAS. Moreover, the authors develop software to ease the creation of new rules and to discover inconsistencies. Findings The authors created 396 mappings and translated 58.8 percent of DLMF formulae (2,405 expressions) successfully between Maple and DLMF. For a significant percentage, the special function definitions in Maple and the DLMF were different. An atomic symbol in one system maps to a composite expression in the other system. The translator was also successfully used for automatic verification of mathematical online compendia and CAS. The evaluation techniques discovered two errors in the DLMF and one defect in Maple. Originality/value This paper introduces the first translation tool for special functions between LaTeX and CAS. The approach improves error-prone manual translations and can be used to verify mathematical online compendia and CAS.
    Date
    20. 1.2015 18:30:22
    Source
    Aslib journal of information management. 71(2019) no.3, S.415-439
  12. Humphrey, S.M.; Névéol, A.; Browne, A.; Gobeil, J.; Ruch, P.; Darmoni, S.J.: Comparing a rule-based versus statistical system for automatic categorization of MEDLINE documents according to biomedical specialty (2009) 0.01
    0.009710538 = product of:
      0.024276346 = sum of:
        0.008315044 = product of:
          0.041575223 = sum of:
            0.041575223 = weight(_text_:problem in 3300) [ClassicSimilarity], result of:
              0.041575223 = score(doc=3300,freq=2.0), product of:
                0.17731056 = queryWeight, product of:
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.04177434 = queryNorm
                0.23447686 = fieldWeight in 3300, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3300)
          0.2 = coord(1/5)
        0.015961302 = weight(_text_:of in 3300) [ClassicSimilarity], result of:
          0.015961302 = score(doc=3300,freq=16.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.24433708 = fieldWeight in 3300, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3300)
      0.4 = coord(2/5)
    
    Abstract
    Automatic document categorization is an important research problem in Information Science and Natural Language Processing. Many applications, including, Word Sense Disambiguation and Information Retrieval in large collections, can benefit from such categorization. This paper focuses on automatic categorization of documents from the biomedical literature into broad discipline-based categories. Two different systems are described and contrasted: CISMeF, which uses rules based on human indexing of the documents by the Medical Subject Headings (MeSH) controlled vocabulary in order to assign metaterms (MTs), and Journal Descriptor Indexing (JDI), based on human categorization of about 4,000 journals and statistical associations between journal descriptors (JDs) and textwords in the documents. We evaluate and compare the performance of these systems against a gold standard of humanly assigned categories for 100 MEDLINE documents, using six measures selected from trec_eval. The results show that for five of the measures performance is comparable, and for one measure JDI is superior. We conclude that these results favor JDI, given the significantly greater intellectual overhead involved in human indexing and maintaining a rule base for mapping MeSH terms to MTs. We also note a JDI method that associates JDs with MeSH indexing rather than textwords, and it may be worthwhile to investigate whether this JDI method (statistical) and CISMeF (rule-based) might be combined and then evaluated showing they are complementary to one another.
    Source
    Journal of the American Society for Information Science and Technology. 60(2009) no.12, S.2530-2539
  13. Martins, E.F.; Belém, F.M.; Almeida, J.M.; Gonçalves, M.A.: On cold start for associative tag recommendation (2016) 0.01
    0.009710538 = product of:
      0.024276346 = sum of:
        0.008315044 = product of:
          0.041575223 = sum of:
            0.041575223 = weight(_text_:problem in 2494) [ClassicSimilarity], result of:
              0.041575223 = score(doc=2494,freq=2.0), product of:
                0.17731056 = queryWeight, product of:
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.04177434 = queryNorm
                0.23447686 = fieldWeight in 2494, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2494)
          0.2 = coord(1/5)
        0.015961302 = weight(_text_:of in 2494) [ClassicSimilarity], result of:
          0.015961302 = score(doc=2494,freq=16.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.24433708 = fieldWeight in 2494, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2494)
      0.4 = coord(2/5)
    
    Abstract
    Tag recommendation strategies that exploit term co-occurrence patterns with tags previously assigned to the target object have consistently produced state-of-the-art results. However, such techniques work only for objects with previously assigned tags. Here we focus on tag recommendation for objects with no tags, a variation of the well-known \textit{cold start} problem. We start by evaluating state-of-the-art co-occurrence based methods in cold start. Our results show that the effectiveness of these methods suffers in this situation. Moreover, we show that employing various automatic filtering strategies to generate an initial tag set that enables the use of co-occurrence patterns produces only marginal improvements. We then propose a new approach that exploits both positive and negative user feedback to iteratively select input tags along with a genetic programming strategy to learn the recommendation function. Our experimental results indicate that extending the methods to include user relevance feedback leads to gains in precision of up to 58% over the best baseline in cold start scenarios and gains of up to 43% over the best baseline in objects that contain some initial tags (i.e., no cold start). We also show that our best relevance-feedback-driven strategy performs well even in scenarios that lack user cooperation (i.e., users may refuse to provide feedback) and user reliability (i.e., users may provide the wrong feedback).
    Source
    Journal of the Association for Information Science and Technology. 67(2016) no.1, S.83-105
  14. Needham, R.M.; Sparck Jones, K.: Keywords and clumps (1985) 0.01
    0.009394583 = product of:
      0.023486458 = sum of:
        0.005820531 = product of:
          0.029102655 = sum of:
            0.029102655 = weight(_text_:problem in 3645) [ClassicSimilarity], result of:
              0.029102655 = score(doc=3645,freq=2.0), product of:
                0.17731056 = queryWeight, product of:
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.04177434 = queryNorm
                0.1641338 = fieldWeight in 3645, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  4.244485 = idf(docFreq=1723, maxDocs=44218)
                  0.02734375 = fieldNorm(doc=3645)
          0.2 = coord(1/5)
        0.017665926 = weight(_text_:of in 3645) [ClassicSimilarity], result of:
          0.017665926 = score(doc=3645,freq=40.0), product of:
            0.06532493 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.04177434 = queryNorm
            0.2704316 = fieldWeight in 3645, product of:
              6.3245554 = tf(freq=40.0), with freq of:
                40.0 = termFreq=40.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.02734375 = fieldNorm(doc=3645)
      0.4 = coord(2/5)
    
    Abstract
    The selection that follows was chosen as it represents "a very early paper an the possibilities allowed by computers an documentation." In the early 1960s computers were being used to provide simple automatic indexing systems wherein keywords were extracted from documents. The problem with such systems was that they lacked vocabulary control, thus documents related in subject matter were not always collocated in retrieval. To improve retrieval by improving recall is the raison d'être of vocabulary control tools such as classifications and thesauri. The question arose whether it was possible by automatic means to construct classes of terms, which when substituted, one for another, could be used to improve retrieval performance? One of the first theoretical approaches to this question was initiated by R. M. Needham and Karen Sparck Jones at the Cambridge Language Research Institute in England.t The question was later pursued using experimental methodologies by Sparck Jones, who, as a Senior Research Associate in the Computer Laboratory at the University of Cambridge, has devoted her life's work to research in information retrieval and automatic naturai language processing. Based an the principles of numerical taxonomy, automatic classification techniques start from the premise that two objects are similar to the degree that they share attributes in common. When these two objects are keywords, their similarity is measured in terms of the number of documents they index in common. Step 1 in automatic classification is to compute mathematically the degree to which two terms are similar. Step 2 is to group together those terms that are "most similar" to each other, forming equivalence classes of intersubstitutable terms. The technique for forming such classes varies and is the factor that characteristically distinguishes different approaches to automatic classification. The technique used by Needham and Sparck Jones, that of clumping, is described in the selection that follows. Questions that must be asked are whether the use of automatically generated classes really does improve retrieval performance and whether there is a true eco nomic advantage in substituting mechanical for manual labor. Several years after her work with clumping, Sparck Jones was to observe that while it was not wholly satisfactory in itself, it was valuable in that it stimulated research into automatic classification. To this it might be added that it was valuable in that it introduced to libraryl information science the methods of numerical taxonomy, thus stimulating us to think again about the fundamental nature and purpose of classification. In this connection it might be useful to review how automatically derived classes differ from those of manually constructed classifications: 1) the manner of their derivation is purely a posteriori, the ultimate operationalization of the principle of literary warrant; 2) the relationship between members forming such classes is essentially statistical; the members of a given class are similar to each other not because they possess the class-defining characteristic but by virtue of sharing a family resemblance; and finally, 3) automatically derived classes are not related meaningfully one to another, that is, they are not ordered in traditional hierarchical and precedence relationships.
    Footnote
    Original in: Journal of documentation 20(1964) no.1, S.5-15.
    Source
    Theory of subject analysis: a sourcebook. Ed.: L.M. Chan, et al
  15. Voorhees, E.M.: Implementing agglomerative hierarchic clustering algorithms for use in document retrieval (1986) 0.01
    0.009055755 = product of:
      0.045278773 = sum of:
        0.045278773 = product of:
          0.090557545 = sum of:
            0.090557545 = weight(_text_:22 in 402) [ClassicSimilarity], result of:
              0.090557545 = score(doc=402,freq=2.0), product of:
                0.14628662 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04177434 = queryNorm
                0.61904186 = fieldWeight in 402, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.125 = fieldNorm(doc=402)
          0.5 = coord(1/2)
      0.2 = coord(1/5)
    
    Source
    Information processing and management. 22(1986) no.6, S.465-476
  16. Fuhr, N.; Niewelt, B.: ¬Ein Retrievaltest mit automatisch indexierten Dokumenten (1984) 0.01
    0.007923785 = product of:
      0.039618924 = sum of:
        0.039618924 = product of:
          0.07923785 = sum of:
            0.07923785 = weight(_text_:22 in 262) [ClassicSimilarity], result of:
              0.07923785 = score(doc=262,freq=2.0), product of:
                0.14628662 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04177434 = queryNorm
                0.5416616 = fieldWeight in 262, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.109375 = fieldNorm(doc=262)
          0.5 = coord(1/2)
      0.2 = coord(1/5)
    
    Date
    20.10.2000 12:22:23
  17. Hlava, M.M.K.: Automatic indexing : comparing rule-based and statistics-based indexing systems (2005) 0.01
    0.007923785 = product of:
      0.039618924 = sum of:
        0.039618924 = product of:
          0.07923785 = sum of:
            0.07923785 = weight(_text_:22 in 6265) [ClassicSimilarity], result of:
              0.07923785 = score(doc=6265,freq=2.0), product of:
                0.14628662 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04177434 = queryNorm
                0.5416616 = fieldWeight in 6265, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.109375 = fieldNorm(doc=6265)
          0.5 = coord(1/2)
      0.2 = coord(1/5)
    
    Source
    Information outlook. 9(2005) no.8, S.22-23
  18. Fuhr, N.: Ranking-Experimente mit gewichteter Indexierung (1986) 0.01
    0.006791815 = product of:
      0.033959076 = sum of:
        0.033959076 = product of:
          0.06791815 = sum of:
            0.06791815 = weight(_text_:22 in 58) [ClassicSimilarity], result of:
              0.06791815 = score(doc=58,freq=2.0), product of:
                0.14628662 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04177434 = queryNorm
                0.46428138 = fieldWeight in 58, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.09375 = fieldNorm(doc=58)
          0.5 = coord(1/2)
      0.2 = coord(1/5)
    
    Date
    14. 6.2015 22:12:44
  19. Hauer, M.: Automatische Indexierung (2000) 0.01
    0.006791815 = product of:
      0.033959076 = sum of:
        0.033959076 = product of:
          0.06791815 = sum of:
            0.06791815 = weight(_text_:22 in 5887) [ClassicSimilarity], result of:
              0.06791815 = score(doc=5887,freq=2.0), product of:
                0.14628662 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04177434 = queryNorm
                0.46428138 = fieldWeight in 5887, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.09375 = fieldNorm(doc=5887)
          0.5 = coord(1/2)
      0.2 = coord(1/5)
    
    Source
    Wissen in Aktion: Wege des Knowledge Managements. 22. Online-Tagung der DGI, Frankfurt am Main, 2.-4.5.2000. Proceedings. Hrsg.: R. Schmidt
  20. Fuhr, N.: Rankingexperimente mit gewichteter Indexierung (1986) 0.01
    0.006791815 = product of:
      0.033959076 = sum of:
        0.033959076 = product of:
          0.06791815 = sum of:
            0.06791815 = weight(_text_:22 in 2051) [ClassicSimilarity], result of:
              0.06791815 = score(doc=2051,freq=2.0), product of:
                0.14628662 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.04177434 = queryNorm
                0.46428138 = fieldWeight in 2051, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.09375 = fieldNorm(doc=2051)
          0.5 = coord(1/2)
      0.2 = coord(1/5)
    
    Date
    14. 6.2015 22:12:56

Languages