Search (547 results, page 1 of 28)

  • × theme_ss:"Computerlinguistik"
  1. Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.31
    0.30843046 = product of:
      0.48467645 = sum of:
        0.04113365 = product of:
          0.1645346 = sum of:
            0.1645346 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
              0.1645346 = score(doc=562,freq=2.0), product of:
                0.2927568 = queryWeight, product of:
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.034531306 = queryNorm
                0.56201804 = fieldWeight in 562, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.046875 = fieldNorm(doc=562)
          0.25 = coord(1/4)
        0.06316024 = weight(_text_:higher in 562) [ClassicSimilarity], result of:
          0.06316024 = score(doc=562,freq=2.0), product of:
            0.18138453 = queryWeight, product of:
              5.252756 = idf(docFreq=628, maxDocs=44218)
              0.034531306 = queryNorm
            0.34821182 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.252756 = idf(docFreq=628, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.1645346 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
          0.1645346 = score(doc=562,freq=2.0), product of:
            0.2927568 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.034531306 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.012516791 = weight(_text_:of in 562) [ClassicSimilarity], result of:
          0.012516791 = score(doc=562,freq=10.0), product of:
            0.053998582 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.034531306 = queryNorm
            0.23179851 = fieldWeight in 562, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.024761025 = weight(_text_:on in 562) [ClassicSimilarity], result of:
          0.024761025 = score(doc=562,freq=10.0), product of:
            0.07594867 = queryWeight, product of:
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.034531306 = queryNorm
            0.32602316 = fieldWeight in 562, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.1645346 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
          0.1645346 = score(doc=562,freq=2.0), product of:
            0.2927568 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.034531306 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.014035545 = product of:
          0.02807109 = sum of:
            0.02807109 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
              0.02807109 = score(doc=562,freq=2.0), product of:
                0.12092275 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.034531306 = queryNorm
                0.23214069 = fieldWeight in 562, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.046875 = fieldNorm(doc=562)
          0.5 = coord(1/2)
      0.6363636 = coord(7/11)
    
    Abstract
    Document representations for text classification are typically based on the classical Bag-Of-Words paradigm. This approach comes with deficiencies that motivate the integration of features on a higher semantic level than single words. In this paper we propose an enhancement of the classical document representation through concepts extracted from background knowledge. Boosting is used for actual classification. Experimental evaluations on two well known text corpora support our approach through consistent improvement of the results.
    Content
    Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
    Date
    8. 1.2013 10:22:32
    Source
    Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), 1-4 November 2004, Brighton, UK
  2. Huo, W.: Automatic multi-word term extraction and its application to Web-page summarization (2012) 0.17
    0.16818675 = product of:
      0.37001085 = sum of:
        0.1645346 = weight(_text_:2f in 563) [ClassicSimilarity], result of:
          0.1645346 = score(doc=563,freq=2.0), product of:
            0.2927568 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.034531306 = queryNorm
            0.56201804 = fieldWeight in 563, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=563)
        0.015832627 = weight(_text_:of in 563) [ClassicSimilarity], result of:
          0.015832627 = score(doc=563,freq=16.0), product of:
            0.053998582 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.034531306 = queryNorm
            0.2932045 = fieldWeight in 563, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=563)
        0.011073467 = weight(_text_:on in 563) [ClassicSimilarity], result of:
          0.011073467 = score(doc=563,freq=2.0), product of:
            0.07594867 = queryWeight, product of:
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.034531306 = queryNorm
            0.14580199 = fieldWeight in 563, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.046875 = fieldNorm(doc=563)
        0.1645346 = weight(_text_:2f in 563) [ClassicSimilarity], result of:
          0.1645346 = score(doc=563,freq=2.0), product of:
            0.2927568 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.034531306 = queryNorm
            0.56201804 = fieldWeight in 563, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=563)
        0.014035545 = product of:
          0.02807109 = sum of:
            0.02807109 = weight(_text_:22 in 563) [ClassicSimilarity], result of:
              0.02807109 = score(doc=563,freq=2.0), product of:
                0.12092275 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.034531306 = queryNorm
                0.23214069 = fieldWeight in 563, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.046875 = fieldNorm(doc=563)
          0.5 = coord(1/2)
      0.45454547 = coord(5/11)
    
    Abstract
    In this thesis we propose three new word association measures for multi-word term extraction. We combine these association measures with LocalMaxs algorithm in our extraction model and compare the results of different multi-word term extraction methods. Our approach is language and domain independent and requires no training data. It can be applied to such tasks as text summarization, information retrieval, and document classification. We further explore the potential of using multi-word terms as an effective representation for general web-page summarization. We extract multi-word terms from human written summaries in a large collection of web-pages, and generate the summaries by aligning document words with these multi-word terms. Our system applies machine translation technology to learn the aligning process from a training set and focuses on selecting high quality multi-word terms from human written summaries to generate suitable results for web-page summarization.
    Content
    A Thesis presented to The University of Guelph In partial fulfilment of requirements for the degree of Master of Science in Computer Science. Vgl. Unter: http://www.inf.ufrgs.br%2F~ceramisch%2Fdownload_files%2Fpublications%2F2009%2Fp01.pdf.
    Date
    10. 1.2013 19:22:47
    Imprint
    Guelph, Ontario : University of Guelph
  3. Noever, D.; Ciolino, M.: ¬The Turing deception (2022) 0.14
    0.13917078 = product of:
      0.38271964 = sum of:
        0.04113365 = product of:
          0.1645346 = sum of:
            0.1645346 = weight(_text_:3a in 862) [ClassicSimilarity], result of:
              0.1645346 = score(doc=862,freq=2.0), product of:
                0.2927568 = queryWeight, product of:
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.034531306 = queryNorm
                0.56201804 = fieldWeight in 862, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.046875 = fieldNorm(doc=862)
          0.25 = coord(1/4)
        0.1645346 = weight(_text_:2f in 862) [ClassicSimilarity], result of:
          0.1645346 = score(doc=862,freq=2.0), product of:
            0.2927568 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.034531306 = queryNorm
            0.56201804 = fieldWeight in 862, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=862)
        0.012516791 = weight(_text_:of in 862) [ClassicSimilarity], result of:
          0.012516791 = score(doc=862,freq=10.0), product of:
            0.053998582 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.034531306 = queryNorm
            0.23179851 = fieldWeight in 862, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=862)
        0.1645346 = weight(_text_:2f in 862) [ClassicSimilarity], result of:
          0.1645346 = score(doc=862,freq=2.0), product of:
            0.2927568 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.034531306 = queryNorm
            0.56201804 = fieldWeight in 862, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=862)
      0.36363637 = coord(4/11)
    
    Abstract
    This research revisits the classic Turing test and compares recent large language models such as ChatGPT for their abilities to reproduce human-level comprehension and compelling text generation. Two task challenges- summary and question answering- prompt ChatGPT to produce original content (98-99%) from a single text entry and sequential questions initially posed by Turing in 1950. We score the original and generated content against the OpenAI GPT-2 Output Detector from 2019, and establish multiple cases where the generated content proves original and undetectable (98%). The question of a machine fooling a human judge recedes in this work relative to the question of "how would one prove it?" The original contribution of the work presents a metric and simple grammatical set for understanding the writing mechanics of chatbots in evaluating their readability and statistical clarity, engagement, delivery, overall quality, and plagiarism risks. While Turing's original prose scores at least 14% below the machine-generated output, whether an algorithm displays hints of Turing's true initial thoughts (the "Lovelace 2.0" test) remains unanswerable.
    Source
    https%3A%2F%2Farxiv.org%2Fabs%2F2212.06721&usg=AOvVaw3i_9pZm9y_dQWoHi6uv0EN
  4. Lu, K.; Cai, X.; Ajiferuke, I.; Wolfram, D.: Vocabulary size and its effect on topic representation (2017) 0.06
    0.058451507 = product of:
      0.16074164 = sum of:
        0.06316024 = weight(_text_:higher in 3414) [ClassicSimilarity], result of:
          0.06316024 = score(doc=3414,freq=2.0), product of:
            0.18138453 = queryWeight, product of:
              5.252756 = idf(docFreq=628, maxDocs=44218)
              0.034531306 = queryNorm
            0.34821182 = fieldWeight in 3414, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.252756 = idf(docFreq=628, maxDocs=44218)
              0.046875 = fieldNorm(doc=3414)
        0.064219736 = weight(_text_:effect in 3414) [ClassicSimilarity], result of:
          0.064219736 = score(doc=3414,freq=2.0), product of:
            0.18289955 = queryWeight, product of:
              5.29663 = idf(docFreq=601, maxDocs=44218)
              0.034531306 = queryNorm
            0.35112026 = fieldWeight in 3414, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.29663 = idf(docFreq=601, maxDocs=44218)
              0.046875 = fieldNorm(doc=3414)
        0.017701415 = weight(_text_:of in 3414) [ClassicSimilarity], result of:
          0.017701415 = score(doc=3414,freq=20.0), product of:
            0.053998582 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.034531306 = queryNorm
            0.32781258 = fieldWeight in 3414, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=3414)
        0.015660247 = weight(_text_:on in 3414) [ClassicSimilarity], result of:
          0.015660247 = score(doc=3414,freq=4.0), product of:
            0.07594867 = queryWeight, product of:
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.034531306 = queryNorm
            0.20619515 = fieldWeight in 3414, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.046875 = fieldNorm(doc=3414)
      0.36363637 = coord(4/11)
    
    Abstract
    This study investigates how computational overhead for topic model training may be reduced by selectively removing terms from the vocabulary of text corpora being modeled. We compare the impact of removing singly occurring terms, the top 0.5%, 1% and 5% most frequently occurring terms and both top 0.5% most frequent and singly occurring terms, along with changes in the number of topics modeled (10, 20, 30, 40, 50, 100) using three datasets. Four outcome measures are compared. The removal of singly occurring terms has little impact on outcomes for all of the measures tested. Document discriminative capacity, as measured by the document space density, is reduced by the removal of frequently occurring terms, but increases with higher numbers of topics. Vocabulary size does not greatly influence entropy, but entropy is affected by the number of topics. Finally, topic similarity, as measured by pairwise topic similarity and Jensen-Shannon divergence, decreases with the removal of frequent terms. The findings have implications for information science research in information retrieval and informetrics that makes use of topic modeling.
  5. Xinglin, L.: Automatic summarization method based on compound word recognition (2015) 0.06
    0.058280453 = product of:
      0.16027124 = sum of:
        0.06316024 = weight(_text_:higher in 1841) [ClassicSimilarity], result of:
          0.06316024 = score(doc=1841,freq=2.0), product of:
            0.18138453 = queryWeight, product of:
              5.252756 = idf(docFreq=628, maxDocs=44218)
              0.034531306 = queryNorm
            0.34821182 = fieldWeight in 1841, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.252756 = idf(docFreq=628, maxDocs=44218)
              0.046875 = fieldNorm(doc=1841)
        0.064219736 = weight(_text_:effect in 1841) [ClassicSimilarity], result of:
          0.064219736 = score(doc=1841,freq=2.0), product of:
            0.18289955 = queryWeight, product of:
              5.29663 = idf(docFreq=601, maxDocs=44218)
              0.034531306 = queryNorm
            0.35112026 = fieldWeight in 1841, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.29663 = idf(docFreq=601, maxDocs=44218)
              0.046875 = fieldNorm(doc=1841)
        0.013711456 = weight(_text_:of in 1841) [ClassicSimilarity], result of:
          0.013711456 = score(doc=1841,freq=12.0), product of:
            0.053998582 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.034531306 = queryNorm
            0.25392252 = fieldWeight in 1841, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=1841)
        0.01917981 = weight(_text_:on in 1841) [ClassicSimilarity], result of:
          0.01917981 = score(doc=1841,freq=6.0), product of:
            0.07594867 = queryWeight, product of:
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.034531306 = queryNorm
            0.25253648 = fieldWeight in 1841, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.046875 = fieldNorm(doc=1841)
      0.36363637 = coord(4/11)
    
    Abstract
    After analyzing main methods of automatic summarization today, we find they all ignore the weight of unknown words in the sentence. In order to overcome this problem, a method for automatic summarization based on compound word recognition is proposed. According to this method, the compound word in the text was identified and the segmentation word was corrected at first. Then, keyword set was extracted from Chinese documents and the sentence weights were calculated according to the weights of the keyword set. Because the weight of compound words was calculated by different weight calculation formula, the corresponding total weight of each sentence will be determined. Finally, sentences with higher weight which will be outputted to make up the summarization sentences by original order were selected by percentage. Experiments were conducted on HIT IR-lab Text Summarization Corpus, the results show that the precision can be achieved 76.51% by the proposed method, and we can conclude that the method is applicable for automatic summarization and the effect is good.
    Source
    Journal of computational information systems (JCIS). 11(2015), no.6, S.2257- 2268
  6. Weingarten, R.: ¬Die Verkabelung der Sprache : Grenzen der Technisierung von Kommunikation (1989) 0.05
    0.051131196 = product of:
      0.28122157 = sum of:
        0.10662842 = weight(_text_:technological in 7156) [ClassicSimilarity], result of:
          0.10662842 = score(doc=7156,freq=4.0), product of:
            0.18347798 = queryWeight, product of:
              5.3133807 = idf(docFreq=591, maxDocs=44218)
              0.034531306 = queryNorm
            0.581151 = fieldWeight in 7156, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.3133807 = idf(docFreq=591, maxDocs=44218)
              0.0546875 = fieldNorm(doc=7156)
        0.17459315 = weight(_text_:innovations in 7156) [ClassicSimilarity], result of:
          0.17459315 = score(doc=7156,freq=4.0), product of:
            0.23478 = queryWeight, product of:
              6.7990475 = idf(docFreq=133, maxDocs=44218)
              0.034531306 = queryNorm
            0.7436458 = fieldWeight in 7156, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              6.7990475 = idf(docFreq=133, maxDocs=44218)
              0.0546875 = fieldNorm(doc=7156)
      0.18181819 = coord(2/11)
    
    LCSH
    Communication / Technological innovations
    Subject
    Communication / Technological innovations
  7. Salton, G.: Automatic processing of foreign language documents (1985) 0.05
    0.050652705 = product of:
      0.13929494 = sum of:
        0.018281942 = weight(_text_:of in 3650) [ClassicSimilarity], result of:
          0.018281942 = score(doc=3650,freq=48.0), product of:
            0.053998582 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.034531306 = queryNorm
            0.33856338 = fieldWeight in 3650, product of:
              6.928203 = tf(freq=48.0), with freq of:
                48.0 = termFreq=48.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.03125 = fieldNorm(doc=3650)
        0.043084387 = weight(_text_:technological in 3650) [ClassicSimilarity], result of:
          0.043084387 = score(doc=3650,freq=2.0), product of:
            0.18347798 = queryWeight, product of:
              5.3133807 = idf(docFreq=591, maxDocs=44218)
              0.034531306 = queryNorm
            0.23482047 = fieldWeight in 3650, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.3133807 = idf(docFreq=591, maxDocs=44218)
              0.03125 = fieldNorm(doc=3650)
        0.07054629 = weight(_text_:innovations in 3650) [ClassicSimilarity], result of:
          0.07054629 = score(doc=3650,freq=2.0), product of:
            0.23478 = queryWeight, product of:
              6.7990475 = idf(docFreq=133, maxDocs=44218)
              0.034531306 = queryNorm
            0.30047828 = fieldWeight in 3650, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              6.7990475 = idf(docFreq=133, maxDocs=44218)
              0.03125 = fieldNorm(doc=3650)
        0.0073823114 = weight(_text_:on in 3650) [ClassicSimilarity], result of:
          0.0073823114 = score(doc=3650,freq=2.0), product of:
            0.07594867 = queryWeight, product of:
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.034531306 = queryNorm
            0.097201325 = fieldWeight in 3650, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.03125 = fieldNorm(doc=3650)
      0.36363637 = coord(4/11)
    
    Abstract
    The attempt to computerize a process, such as indexing, abstracting, classifying, or retrieving information, begins with an analysis of the process into its intellectual and nonintellectual components. That part of the process which is amenable to computerization is mechanical or algorithmic. What is not is intellectual or creative and requires human intervention. Gerard Salton has been an innovator, experimenter, and promoter in the area of mechanized information systems since the early 1960s. He has been particularly ingenious at analyzing the process of information retrieval into its algorithmic components. He received a doctorate in applied mathematics from Harvard University before moving to the computer science department at Cornell, where he developed a prototype automatic retrieval system called SMART. Working with this system he and his students contributed for over a decade to our theoretical understanding of the retrieval process. On a more practical level, they have contributed design criteria for operating retrieval systems. The following selection presents one of the early descriptions of the SMART system; it is valuable as it shows the direction automatic retrieval methods were to take beyond simple word-matching techniques. These include various word normalization techniques to improve recall, for instance, the separation of words into stems and affixes; the correlation and clustering, using statistical association measures, of related terms; and the identification, using a concept thesaurus, of synonymous, broader, narrower, and sibling terms. They include, as weIl, techniques, both linguistic and statistical, to deal with the thorny problem of how to automatically extract from texts index terms that consist of more than one word. They include weighting techniques and various documentrequest matching algorithms. Significant among the latter are those which produce a retrieval output of citations ranked in relevante order. During the 1970s, Salton and his students went an to further refine these various techniques, particularly the weighting and statistical association measures. Many of their early innovations seem commonplace today. Some of their later techniques are still ahead of their time and await technological developments for implementation. The particular focus of the selection that follows is an the evaluation of a particular component of the SMART system, a multilingual thesaurus. By mapping English language expressions and their German equivalents to a common concept number, the thesaurus permitted the automatic processing of German language documents against English language queries and vice versa. The results of the evaluation, as it turned out, were somewhat inconclusive. However, this SMART experiment suggested in a bold and optimistic way how one might proceed to answer such complex questions as What is meant by retrieval language compatability? How it is to be achieved, and how evaluated?
    Footnote
    Original in: Journal of the American Society for Information Science 21(1970) no.3, S.187-194.
    Source
    Theory of subject analysis: a sourcebook. Ed.: L.M. Chan, et al
  8. Azpiazu, I.M.; Soledad Pera, M.: Is cross-lingual readability assessment possible? (2020) 0.04
    0.04132643 = product of:
      0.11364768 = sum of:
        0.042106826 = weight(_text_:higher in 5868) [ClassicSimilarity], result of:
          0.042106826 = score(doc=5868,freq=2.0), product of:
            0.18138453 = queryWeight, product of:
              5.252756 = idf(docFreq=628, maxDocs=44218)
              0.034531306 = queryNorm
            0.23214121 = fieldWeight in 5868, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.252756 = idf(docFreq=628, maxDocs=44218)
              0.03125 = fieldNorm(doc=5868)
        0.04281316 = weight(_text_:effect in 5868) [ClassicSimilarity], result of:
          0.04281316 = score(doc=5868,freq=2.0), product of:
            0.18289955 = queryWeight, product of:
              5.29663 = idf(docFreq=601, maxDocs=44218)
              0.034531306 = queryNorm
            0.23408018 = fieldWeight in 5868, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.29663 = idf(docFreq=601, maxDocs=44218)
              0.03125 = fieldNorm(doc=5868)
        0.013963064 = weight(_text_:of in 5868) [ClassicSimilarity], result of:
          0.013963064 = score(doc=5868,freq=28.0), product of:
            0.053998582 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.034531306 = queryNorm
            0.25858206 = fieldWeight in 5868, product of:
              5.2915025 = tf(freq=28.0), with freq of:
                28.0 = termFreq=28.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.03125 = fieldNorm(doc=5868)
        0.014764623 = weight(_text_:on in 5868) [ClassicSimilarity], result of:
          0.014764623 = score(doc=5868,freq=8.0), product of:
            0.07594867 = queryWeight, product of:
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.034531306 = queryNorm
            0.19440265 = fieldWeight in 5868, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.03125 = fieldNorm(doc=5868)
      0.36363637 = coord(4/11)
    
    Abstract
    Most research efforts related to automatic readability assessment focus on the design of strategies that apply to a specific language. These state-of-the-art strategies are highly dependent on linguistic features that best suit the language for which they were intended, constraining their adaptability and making it difficult to determine whether they would remain effective if they were applied to estimate the level of difficulty of texts in other languages. In this article, we present the results of a study designed to determine the feasibility of a cross-lingual readability assessment strategy. For doing so, we first analyzed the most common features used for readability assessment and determined their influence on the readability prediction process of 6 different languages: English, Spanish, Basque, Italian, French, and Catalan. In addition, we developed a cross-lingual readability assessment strategy that serves as a means to empirically explore the potential advantages of employing a single strategy (and set of features) for readability assessment in different languages, including interlanguage prediction agreement and prediction accuracy improvement for low-resource languages.Friend request acceptance and information disclosure constitute 2 important privacy decisions for users to control the flow of their personal information in social network sites (SNSs). These decisions are greatly influenced by contextual characteristics of the request. However, the contextual influence may not be uniform among users with different levels of privacy concerns. In this study, we hypothesize that users with higher privacy concerns may consider contextual factors differently from those with lower privacy concerns. By conducting a scenario-based survey study and structural equation modeling, we verify the interaction effects between privacy concerns and contextual factors. We additionally find that users' perceived risk towards the requester mediates the effect of context and privacy concerns. These results extend our understanding about the cognitive process behind privacy decision making in SNSs. The interaction effects suggest strategies for SNS providers to predict user's friend request acceptance and to customize context-aware privacy decision support based on users' different privacy attitudes.
    Source
    Journal of the Association for Information Science and Technology. 71(2020) no.6, S.644-656
  9. Andrushchenko, M.; Sandberg, K.; Turunen, R.; Marjanen, J.; Hatavara, M.; Kurunmäki, J.; Nummenmaa, T.; Hyvärinen, M.; Teräs, K.; Peltonen, J.; Nummenmaa, J.: Using parsed and annotated corpora to analyze parliamentarians' talk in Finland (2022) 0.03
    0.032628328 = product of:
      0.1196372 = sum of:
        0.015471167 = weight(_text_:of in 471) [ClassicSimilarity], result of:
          0.015471167 = score(doc=471,freq=22.0), product of:
            0.053998582 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.034531306 = queryNorm
            0.28651062 = fieldWeight in 471, product of:
              4.690416 = tf(freq=22.0), with freq of:
                22.0 = termFreq=22.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=471)
        0.08818286 = weight(_text_:innovations in 471) [ClassicSimilarity], result of:
          0.08818286 = score(doc=471,freq=2.0), product of:
            0.23478 = queryWeight, product of:
              6.7990475 = idf(docFreq=133, maxDocs=44218)
              0.034531306 = queryNorm
            0.37559783 = fieldWeight in 471, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              6.7990475 = idf(docFreq=133, maxDocs=44218)
              0.0390625 = fieldNorm(doc=471)
        0.015983174 = weight(_text_:on in 471) [ClassicSimilarity], result of:
          0.015983174 = score(doc=471,freq=6.0), product of:
            0.07594867 = queryWeight, product of:
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.034531306 = queryNorm
            0.21044704 = fieldWeight in 471, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.0390625 = fieldNorm(doc=471)
      0.27272728 = coord(3/11)
    
    Abstract
    We present a search system for grammatically analyzed corpora of Finnish parliamentary records and interviews with former parliamentarians, annotated with metadata of talk structure and involved parliamentarians, and discuss their use through carefully chosen digital humanities case studies. We first introduce the construction, contents, and principles of use of the corpora. Then we discuss the application of the search system and the corpora to study how politicians talk about power, how ideological terms are used in political speech, and how to identify narratives in the data. All case studies stem from questions in the humanities and the social sciences, but rely on the grammatically parsed corpora in both identifying and quantifying passages of interest. Finally, the paper discusses the role of natural language processing methods for questions in the (digital) humanities. It makes the claim that a digital humanities inquiry of parliamentary speech and interviews with politicians cannot only rely on computational humanities modeling, but needs to accommodate a range of perspectives starting with simple searches, quantitative exploration, and ending with modeling. Furthermore, the digital humanities need a more thorough discussion about how the utilization of tools from information science and technologies alter the research questions posed in the humanities.
    Series
    JASIST special issue on digital humanities (DH): C. Methodological innovations, challenges, and new interest in DH
    Source
    Journal of the Association for Information Science and Technology. 73(2022) no.2, S.288-302
  10. Melucci, M.; Orio, N.: Design, implementation, and evaluation of a methodology for automatic stemmer generation (2007) 0.03
    0.031638034 = product of:
      0.11600612 = sum of:
        0.0130612515 = weight(_text_:of in 268) [ClassicSimilarity], result of:
          0.0130612515 = score(doc=268,freq=8.0), product of:
            0.053998582 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.034531306 = queryNorm
            0.24188137 = fieldWeight in 268, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0546875 = fieldNorm(doc=268)
        0.01827029 = weight(_text_:on in 268) [ClassicSimilarity], result of:
          0.01827029 = score(doc=268,freq=4.0), product of:
            0.07594867 = queryWeight, product of:
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.034531306 = queryNorm
            0.24056101 = fieldWeight in 268, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.0546875 = fieldNorm(doc=268)
        0.08467458 = weight(_text_:great in 268) [ClassicSimilarity], result of:
          0.08467458 = score(doc=268,freq=2.0), product of:
            0.19443816 = queryWeight, product of:
              5.6307793 = idf(docFreq=430, maxDocs=44218)
              0.034531306 = queryNorm
            0.43548337 = fieldWeight in 268, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.6307793 = idf(docFreq=430, maxDocs=44218)
              0.0546875 = fieldNorm(doc=268)
      0.27272728 = coord(3/11)
    
    Abstract
    The authors describe a statistical approach based on hidden Markov models (HMMs), for generating stemmers automatically. The proposed approach requires little effort to insert new languages in the system even if minimal linguistic knowledge is available. This is a key advantage especially for digital libraries, which are often developed for a specific institution or government because the program can manage a great amount of documents written in local languages. The evaluation described in the article shows that the stemmers implemented by means of HMMs are as effective as those based on linguistic rules.
    Source
    Journal of the American Society for Information Science and Technology. 58(2007) no.5, S.673-686
  11. Suissa, O.; Elmalech, A.; Zhitomirsky-Geffet, M.: Text analysis using deep neural networks in digital humanities and information science (2022) 0.03
    0.030164892 = product of:
      0.1106046 = sum of:
        0.013193856 = weight(_text_:of in 491) [ClassicSimilarity], result of:
          0.013193856 = score(doc=491,freq=16.0), product of:
            0.053998582 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.034531306 = queryNorm
            0.24433708 = fieldWeight in 491, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=491)
        0.08818286 = weight(_text_:innovations in 491) [ClassicSimilarity], result of:
          0.08818286 = score(doc=491,freq=2.0), product of:
            0.23478 = queryWeight, product of:
              6.7990475 = idf(docFreq=133, maxDocs=44218)
              0.034531306 = queryNorm
            0.37559783 = fieldWeight in 491, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              6.7990475 = idf(docFreq=133, maxDocs=44218)
              0.0390625 = fieldNorm(doc=491)
        0.009227889 = weight(_text_:on in 491) [ClassicSimilarity], result of:
          0.009227889 = score(doc=491,freq=2.0), product of:
            0.07594867 = queryWeight, product of:
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.034531306 = queryNorm
            0.121501654 = fieldWeight in 491, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.0390625 = fieldNorm(doc=491)
      0.27272728 = coord(3/11)
    
    Abstract
    Combining computational technologies and humanities is an ongoing effort aimed at making resources such as texts, images, audio, video, and other artifacts digitally available, searchable, and analyzable. In recent years, deep neural networks (DNN) dominate the field of automatic text analysis and natural language processing (NLP), in some cases presenting a super-human performance. DNNs are the state-of-the-art machine learning algorithms solving many NLP tasks that are relevant for Digital Humanities (DH) research, such as spell checking, language detection, entity extraction, author detection, question answering, and other tasks. These supervised algorithms learn patterns from a large number of "right" and "wrong" examples and apply them to new examples. However, using DNNs for analyzing the text resources in DH research presents two main challenges: (un)availability of training data and a need for domain adaptation. This paper explores these challenges by analyzing multiple use-cases of DH studies in recent literature and their possible solutions and lays out a practical decision model for DH experts for when and how to choose the appropriate deep learning approaches for their research. Moreover, in this paper, we aim to raise awareness of the benefits of utilizing deep learning models in the DH community.
    Series
    JASIST special issue on digital humanities (DH): C. Methodological innovations, challenges, and new interest in DH
    Source
    Journal of the Association for Information Science and Technology. 73(2022) no.2, S.268-287
  12. Carrillo-de-Albornoz, J.; Plaza, L.: ¬An emotion-based model of negation, intensifiers, and modality for polarity and intensity classification (2013) 0.03
    0.029219463 = product of:
      0.10713803 = sum of:
        0.07568369 = weight(_text_:effect in 1005) [ClassicSimilarity], result of:
          0.07568369 = score(doc=1005,freq=4.0), product of:
            0.18289955 = queryWeight, product of:
              5.29663 = idf(docFreq=601, maxDocs=44218)
              0.034531306 = queryNorm
            0.41379923 = fieldWeight in 1005, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.29663 = idf(docFreq=601, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1005)
        0.015471167 = weight(_text_:of in 1005) [ClassicSimilarity], result of:
          0.015471167 = score(doc=1005,freq=22.0), product of:
            0.053998582 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.034531306 = queryNorm
            0.28651062 = fieldWeight in 1005, product of:
              4.690416 = tf(freq=22.0), with freq of:
                22.0 = termFreq=22.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1005)
        0.015983174 = weight(_text_:on in 1005) [ClassicSimilarity], result of:
          0.015983174 = score(doc=1005,freq=6.0), product of:
            0.07594867 = queryWeight, product of:
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.034531306 = queryNorm
            0.21044704 = fieldWeight in 1005, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1005)
      0.27272728 = coord(3/11)
    
    Abstract
    Negation, intensifiers, and modality are common linguistic constructions that may modify the emotional meaning of the text and therefore need to be taken into consideration in sentiment analysis. Negation is usually considered as a polarity shifter, whereas intensifiers are regarded as amplifiers or diminishers of the strength of such polarity. Modality, in turn, has only been addressed in a very naïve fashion, so that modal forms are treated as polarity blockers. However, processing these constructions as mere polarity modifiers may be adequate for polarity classification, but it is not enough for more complex tasks (e.g., intensity classification), for which a more fine-grained model based on emotions is needed. In this work, we study the effect of modifiers on the emotions affected by them and propose a model of negation, intensifiers, and modality especially conceived for sentiment analysis tasks. We compare our emotion-based strategy with two traditional approaches based on polar expressions and find that representing the text as a set of emotions increases accuracy in different classification tasks and that this representation allows for a more accurate modeling of modifiers that results in further classification improvements. We also study the most common uses of modifiers in opinionated texts and quantify their impact in polarity and intensity classification. Finally, we analyze the joint effect of emotional modifiers and find that interesting synergies exist between them.
    Source
    Journal of the American Society for Information Science and Technology. 64(2013) no.8, S.1618-1633
  13. Kokol, P.; Podgorelec, V.; Zorman, M.; Kokol, T.; Njivar, T.: Computer and natural language texts : a comparison based on long-range correlations (1999) 0.03
    0.028319668 = product of:
      0.10383878 = sum of:
        0.07492303 = weight(_text_:effect in 4299) [ClassicSimilarity], result of:
          0.07492303 = score(doc=4299,freq=2.0), product of:
            0.18289955 = queryWeight, product of:
              5.29663 = idf(docFreq=601, maxDocs=44218)
              0.034531306 = queryNorm
            0.4096403 = fieldWeight in 4299, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.29663 = idf(docFreq=601, maxDocs=44218)
              0.0546875 = fieldNorm(doc=4299)
        0.0159967 = weight(_text_:of in 4299) [ClassicSimilarity], result of:
          0.0159967 = score(doc=4299,freq=12.0), product of:
            0.053998582 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.034531306 = queryNorm
            0.29624295 = fieldWeight in 4299, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0546875 = fieldNorm(doc=4299)
        0.012919044 = weight(_text_:on in 4299) [ClassicSimilarity], result of:
          0.012919044 = score(doc=4299,freq=2.0), product of:
            0.07594867 = queryWeight, product of:
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.034531306 = queryNorm
            0.17010231 = fieldWeight in 4299, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.0546875 = fieldNorm(doc=4299)
      0.27272728 = coord(3/11)
    
    Abstract
    'Long-range power low correlation' (LRC) is defined as a maximal propagation distance of the effect of some disturbance within a system found in many systems that can be represented as strings of symbols. LRC between characters has also been identified in natural language texts. The aim of this article is to show that long-range power law correlations can also be found in computer programs, meaning that some common laws hold for both natural language texts and computer programs. This fact enables one to draw parallels between these 2 different types of human writings, and also enables one to measure the differences between them
    Source
    Journal of the American Society for Information Science. 50(1999) no.14, S.1295-1301
  14. Peis, E.; Herrera-Viedma, E.; Herrera, J.C.: On the evaluation of XML documents using Fuzzy linguistic techniques (2003) 0.03
    0.027132085 = product of:
      0.09948431 = sum of:
        0.015832627 = weight(_text_:of in 2778) [ClassicSimilarity], result of:
          0.015832627 = score(doc=2778,freq=16.0), product of:
            0.053998582 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.034531306 = queryNorm
            0.2932045 = fieldWeight in 2778, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=2778)
        0.011073467 = weight(_text_:on in 2778) [ClassicSimilarity], result of:
          0.011073467 = score(doc=2778,freq=2.0), product of:
            0.07594867 = queryWeight, product of:
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.034531306 = queryNorm
            0.14580199 = fieldWeight in 2778, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.046875 = fieldNorm(doc=2778)
        0.072578214 = weight(_text_:great in 2778) [ClassicSimilarity], result of:
          0.072578214 = score(doc=2778,freq=2.0), product of:
            0.19443816 = queryWeight, product of:
              5.6307793 = idf(docFreq=430, maxDocs=44218)
              0.034531306 = queryNorm
            0.37327147 = fieldWeight in 2778, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.6307793 = idf(docFreq=430, maxDocs=44218)
              0.046875 = fieldNorm(doc=2778)
      0.27272728 = coord(3/11)
    
    Abstract
    Recommender systems evaluate and filter the great amount of information available an the Web to assist people in their search processes. A fuzzy evaluation method of XML documents based an computing with words is presented. Given an XML document type (e.g. scientific article), we consider that its elements are not equally informative. This is indicated by the use of a DTD and defining linguistic importance attributes to the more meaningful elements of the DTD designed. Then, the evaluation method generates linguistic recommendations from linguistic evaluation judgements provided by different recommenders an meaningful elements of DTD.
    Source
    Challenges in knowledge representation and organization for the 21st century: Integration of knowledge across boundaries. Proceedings of the 7th ISKO International Conference Granada, Spain, July 10-13, 2002. Ed.: M. López-Huertas
  15. Airio, E.: Who benefits from CLIR in web retrieval? (2008) 0.03
    0.026784439 = product of:
      0.098209605 = sum of:
        0.064219736 = weight(_text_:effect in 2342) [ClassicSimilarity], result of:
          0.064219736 = score(doc=2342,freq=2.0), product of:
            0.18289955 = queryWeight, product of:
              5.29663 = idf(docFreq=601, maxDocs=44218)
              0.034531306 = queryNorm
            0.35112026 = fieldWeight in 2342, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.29663 = idf(docFreq=601, maxDocs=44218)
              0.046875 = fieldNorm(doc=2342)
        0.014810067 = weight(_text_:of in 2342) [ClassicSimilarity], result of:
          0.014810067 = score(doc=2342,freq=14.0), product of:
            0.053998582 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.034531306 = queryNorm
            0.2742677 = fieldWeight in 2342, product of:
              3.7416575 = tf(freq=14.0), with freq of:
                14.0 = termFreq=14.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=2342)
        0.01917981 = weight(_text_:on in 2342) [ClassicSimilarity], result of:
          0.01917981 = score(doc=2342,freq=6.0), product of:
            0.07594867 = queryWeight, product of:
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.034531306 = queryNorm
            0.25253648 = fieldWeight in 2342, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.046875 = fieldNorm(doc=2342)
      0.27272728 = coord(3/11)
    
    Abstract
    Purpose - The aim of the current paper is to test whether query translation is beneficial in web retrieval. Design/methodology/approach - The language pairs were Finnish-Swedish, English-German and Finnish-French. A total of 12-18 participants were recruited for each language pair. Each participant performed four retrieval tasks. The author's aim was to compare the performance of the translated queries with that of the target language queries. Thus, the author asked participants to formulate a source language query and a target language query for each task. The source language queries were translated into the target language utilizing a dictionary-based system. In English-German, also machine translation was utilized. The author used Google as the search engine. Findings - The results differed depending on the language pair. The author concluded that the dictionary coverage had an effect on the results. On average, the results of query-translation were better than in the traditional laboratory tests. Originality/value - This research shows that query translation in web is beneficial especially for users with moderate and non-active language skills. This is valuable information for developers of cross-language information retrieval systems.
    Source
    Journal of documentation. 64(2008) no.5, S.760-778
  16. Ahmed, F.; Nürnberger, A.: Evaluation of n-gram conflation approaches for Arabic text retrieval (2009) 0.03
    0.026195865 = product of:
      0.0960515 = sum of:
        0.06316024 = weight(_text_:higher in 2941) [ClassicSimilarity], result of:
          0.06316024 = score(doc=2941,freq=2.0), product of:
            0.18138453 = queryWeight, product of:
              5.252756 = idf(docFreq=628, maxDocs=44218)
              0.034531306 = queryNorm
            0.34821182 = fieldWeight in 2941, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.252756 = idf(docFreq=628, maxDocs=44218)
              0.046875 = fieldNorm(doc=2941)
        0.013711456 = weight(_text_:of in 2941) [ClassicSimilarity], result of:
          0.013711456 = score(doc=2941,freq=12.0), product of:
            0.053998582 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.034531306 = queryNorm
            0.25392252 = fieldWeight in 2941, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=2941)
        0.01917981 = weight(_text_:on in 2941) [ClassicSimilarity], result of:
          0.01917981 = score(doc=2941,freq=6.0), product of:
            0.07594867 = queryWeight, product of:
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.034531306 = queryNorm
            0.25253648 = fieldWeight in 2941, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.046875 = fieldNorm(doc=2941)
      0.27272728 = coord(3/11)
    
    Abstract
    In this paper we present a language-independent approach for conflation that does not depend on predefined rules or prior knowledge of the target language. The proposed unsupervised method is based on an enhancement of the pure n-gram model that can group related words based on various string-similarity measures, while restricting the search to specific locations of the target word by taking into account the order of n-grams. We show that the method is effective to achieve high score similarities for all word-form variations and reduces the ambiguity, i.e., obtains a higher precision and recall, compared to pure n-gram-based approaches for English, Portuguese, and Arabic. The proposed method is especially suited for conflation approaches in Arabic, since Arabic is a highly inflectional language. Therefore, we present in addition an adaptive user interface for Arabic text retrieval called araSearch. araSearch serves as a metasearch interface to existing search engines. The system is able to extend a query using the proposed conflation approach such that additional results for relevant subwords can be found automatically.
    Source
    Journal of the American Society for Information Science and Technology. 60(2009) no.7, S.1448-1465
  17. Moohebat, M.; Raj, R.G.; Kareem, S.B.A.; Thorleuchter, D.: Identifying ISI-indexed articles by their lexical usage : a text analysis approach (2015) 0.03
    0.025235984 = product of:
      0.09253194 = sum of:
        0.06316024 = weight(_text_:higher in 1664) [ClassicSimilarity], result of:
          0.06316024 = score(doc=1664,freq=2.0), product of:
            0.18138453 = queryWeight, product of:
              5.252756 = idf(docFreq=628, maxDocs=44218)
              0.034531306 = queryNorm
            0.34821182 = fieldWeight in 1664, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.252756 = idf(docFreq=628, maxDocs=44218)
              0.046875 = fieldNorm(doc=1664)
        0.013711456 = weight(_text_:of in 1664) [ClassicSimilarity], result of:
          0.013711456 = score(doc=1664,freq=12.0), product of:
            0.053998582 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.034531306 = queryNorm
            0.25392252 = fieldWeight in 1664, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=1664)
        0.015660247 = weight(_text_:on in 1664) [ClassicSimilarity], result of:
          0.015660247 = score(doc=1664,freq=4.0), product of:
            0.07594867 = queryWeight, product of:
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.034531306 = queryNorm
            0.20619515 = fieldWeight in 1664, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.046875 = fieldNorm(doc=1664)
      0.27272728 = coord(3/11)
    
    Abstract
    This research creates an architecture for investigating the existence of probable lexical divergences between articles, categorized as Institute for Scientific Information (ISI) and non-ISI, and consequently, if such a difference is discovered, to propose the best available classification method. Based on a collection of ISI- and non-ISI-indexed articles in the areas of business and computer science, three classification models are trained. A sensitivity analysis is applied to demonstrate the impact of words in different syntactical forms on the classification decision. The results demonstrate that the lexical domains of ISI and non-ISI articles are distinguishable by machine learning techniques. Our findings indicate that the support vector machine identifies ISI-indexed articles in both disciplines with higher precision than do the Naïve Bayesian and K-Nearest Neighbors techniques.
    Source
    Journal of the Association for Information Science and Technology. 66(2015) no.3, S.501-511
  18. Herrera-Viedma, E.: Modeling the retrieval process for an information retrieval system using an ordinal fuzzy linguistic approach (2001) 0.02
    0.024783067 = product of:
      0.090871245 = sum of:
        0.053516448 = weight(_text_:effect in 5752) [ClassicSimilarity], result of:
          0.053516448 = score(doc=5752,freq=2.0), product of:
            0.18289955 = queryWeight, product of:
              5.29663 = idf(docFreq=601, maxDocs=44218)
              0.034531306 = queryNorm
            0.2926002 = fieldWeight in 5752, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.29663 = idf(docFreq=601, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5752)
        0.014751178 = weight(_text_:of in 5752) [ClassicSimilarity], result of:
          0.014751178 = score(doc=5752,freq=20.0), product of:
            0.053998582 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.034531306 = queryNorm
            0.27317715 = fieldWeight in 5752, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5752)
        0.02260362 = weight(_text_:on in 5752) [ClassicSimilarity], result of:
          0.02260362 = score(doc=5752,freq=12.0), product of:
            0.07594867 = queryWeight, product of:
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.034531306 = queryNorm
            0.29761705 = fieldWeight in 5752, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5752)
      0.27272728 = coord(3/11)
    
    Abstract
    A linguistic model for an Information Retrieval System (IRS) defined using an ordinal fuzzy linguistic approach is proposed. The ordinal fuzzy linguistic approach is presented, and its use for modeling the imprecision and subjectivity that appear in the user-IRS interaction is studied. The user queries and IRS responses are modeled linguistically using the concept of fuzzy linguistic variables. The system accepts Boolean queries whose terms can be weighted simultaneously by means of ordinal linguistic values according to three possible semantics: a symmetrical threshold semantic, a quantitative semantic, and an importance semantic. The first one identifies a new threshold semantic used to express qualitative restrictions on the documents retrieved for a given term. It is monotone increasing in index term weight for the threshold values that are on the right of the mid-value, and decreasing for the threshold values that are on the left of the mid-value. The second one is a new semantic proposal introduced to express quantitative restrictions on the documents retrieved for a term, i.e., restrictions on the number of documents that must be retrieved containing that term. The last one is the usual semantic of relative importance that has an effect when the term is in a Boolean expression. A bottom-up evaluation mechanism of queries is presented that coherently integrates the use of the three semantics and satisfies the separability property. The advantage of this IRS with respect to others is that users can express linguistically different semantic restrictions on the desired documents simultaneously, incorporating more flexibility in the user-IRS interaction
    Source
    Journal of the American Society for Information Science and technology. 52(2001) no.6, S.460-475
  19. Kim, S.; Ko, Y.; Oard, D.W.: Combining lexical and statistical translation evidence for cross-language information retrieval (2015) 0.02
    0.024429668 = product of:
      0.08957545 = sum of:
        0.064219736 = weight(_text_:effect in 1606) [ClassicSimilarity], result of:
          0.064219736 = score(doc=1606,freq=2.0), product of:
            0.18289955 = queryWeight, product of:
              5.29663 = idf(docFreq=601, maxDocs=44218)
              0.034531306 = queryNorm
            0.35112026 = fieldWeight in 1606, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.29663 = idf(docFreq=601, maxDocs=44218)
              0.046875 = fieldNorm(doc=1606)
        0.009695465 = weight(_text_:of in 1606) [ClassicSimilarity], result of:
          0.009695465 = score(doc=1606,freq=6.0), product of:
            0.053998582 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.034531306 = queryNorm
            0.17955035 = fieldWeight in 1606, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=1606)
        0.015660247 = weight(_text_:on in 1606) [ClassicSimilarity], result of:
          0.015660247 = score(doc=1606,freq=4.0), product of:
            0.07594867 = queryWeight, product of:
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.034531306 = queryNorm
            0.20619515 = fieldWeight in 1606, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.046875 = fieldNorm(doc=1606)
      0.27272728 = coord(3/11)
    
    Abstract
    This article explores how best to use lexical and statistical translation evidence together for cross-language information retrieval (CLIR). Lexical translation evidence is assembled from Wikipedia and from a large machine-readable dictionary, statistical translation evidence is drawn from parallel corpora, and evidence from co-occurrence in the document language provides a basis for limiting the adverse effect of translation ambiguity. Coverage statistics for NII Testbeds and Community for Information Access Research (NTCIR) queries confirm that these resources have complementary strengths. Experiments with translation evidence from a small parallel corpus indicate that even rather rough estimates of translation probabilities can yield further improvements over a strong technique for translation weighting based on using Jensen-Shannon divergence as a term-association measure. Finally, a novel approach to posttranslation query expansion using a random walk over the Wikipedia concept link graph is shown to yield further improvements over alternative techniques for posttranslation query expansion. Evaluation results on the NTCIR-5 English-Korean test collection show statistically significant improvements over strong baselines.
    Source
    Journal of the Association for Information Science and Technology. 66(2015) no.1, S.23-39
  20. Ali, C.B.; Haddad, H.; Slimani, Y.: Multi-word terms selection for information retrieval (2022) 0.02
    0.02432607 = product of:
      0.08919559 = sum of:
        0.008079554 = weight(_text_:of in 900) [ClassicSimilarity], result of:
          0.008079554 = score(doc=900,freq=6.0), product of:
            0.053998582 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.034531306 = queryNorm
            0.1496253 = fieldWeight in 900, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=900)
        0.02063419 = weight(_text_:on in 900) [ClassicSimilarity], result of:
          0.02063419 = score(doc=900,freq=10.0), product of:
            0.07594867 = queryWeight, product of:
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.034531306 = queryNorm
            0.271686 = fieldWeight in 900, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              2.199415 = idf(docFreq=13325, maxDocs=44218)
              0.0390625 = fieldNorm(doc=900)
        0.060481843 = weight(_text_:great in 900) [ClassicSimilarity], result of:
          0.060481843 = score(doc=900,freq=2.0), product of:
            0.19443816 = queryWeight, product of:
              5.6307793 = idf(docFreq=430, maxDocs=44218)
              0.034531306 = queryNorm
            0.31105953 = fieldWeight in 900, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.6307793 = idf(docFreq=430, maxDocs=44218)
              0.0390625 = fieldNorm(doc=900)
      0.27272728 = coord(3/11)
    
    Abstract
    Purpose A number of approaches and algorithms have been proposed over the years as a basis for automatic indexing. Many of these approaches suffer from precision inefficiency at low recall. The choice of indexing units has a great impact on search system effectiveness. The authors dive beyond simple terms indexing to propose a framework for multi-word terms (MWT) filtering and indexing. Design/methodology/approach In this paper, the authors rely on ranking MWT to filter them, keeping the most effective ones for the indexing process. The proposed model is based on filtering MWT according to their ability to capture the document topic and distinguish between different documents from the same collection. The authors rely on the hypothesis that the best MWT are those that achieve the greatest association degree. The experiments are carried out with English and French languages data sets. Findings The results indicate that this approach achieved precision enhancements at low recall, and it performed better than more advanced models based on terms dependencies. Originality/value Using and testing different association measures to select MWT that best describe the documents to enhance the precision in the first retrieved documents.

Languages

Types

  • a 463
  • el 57
  • m 39
  • s 21
  • x 12
  • p 7
  • b 1
  • d 1
  • n 1
  • r 1
  • More… Less…

Subjects

Classifications