Search (110 results, page 1 of 6)

  • × theme_ss:"Computerlinguistik"
  • × year_i:[2010 TO 2020}
  1. Huo, W.: Automatic multi-word term extraction and its application to Web-page summarization (2012) 0.35
    0.34568116 = product of:
      0.4609082 = sum of:
        0.025048172 = weight(_text_:retrieval in 563) [ClassicSimilarity], result of:
          0.025048172 = score(doc=563,freq=2.0), product of:
            0.124912694 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.041294612 = queryNorm
            0.20052543 = fieldWeight in 563, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.046875 = fieldNorm(doc=563)
        0.19676036 = weight(_text_:2f in 563) [ClassicSimilarity], result of:
          0.19676036 = score(doc=563,freq=2.0), product of:
            0.35009617 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.041294612 = queryNorm
            0.56201804 = fieldWeight in 563, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=563)
        0.018933605 = weight(_text_:of in 563) [ClassicSimilarity], result of:
          0.018933605 = score(doc=563,freq=16.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.2932045 = fieldWeight in 563, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=563)
        0.19676036 = weight(_text_:2f in 563) [ClassicSimilarity], result of:
          0.19676036 = score(doc=563,freq=2.0), product of:
            0.35009617 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.041294612 = queryNorm
            0.56201804 = fieldWeight in 563, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=563)
        0.006621159 = product of:
          0.013242318 = sum of:
            0.013242318 = weight(_text_:on in 563) [ClassicSimilarity], result of:
              0.013242318 = score(doc=563,freq=2.0), product of:
                0.090823986 = queryWeight, product of:
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.041294612 = queryNorm
                0.14580199 = fieldWeight in 563, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.046875 = fieldNorm(doc=563)
          0.5 = coord(1/2)
        0.016784549 = product of:
          0.033569098 = sum of:
            0.033569098 = weight(_text_:22 in 563) [ClassicSimilarity], result of:
              0.033569098 = score(doc=563,freq=2.0), product of:
                0.1446067 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.041294612 = queryNorm
                0.23214069 = fieldWeight in 563, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.046875 = fieldNorm(doc=563)
          0.5 = coord(1/2)
      0.75 = coord(6/8)
    
    Abstract
    In this thesis we propose three new word association measures for multi-word term extraction. We combine these association measures with LocalMaxs algorithm in our extraction model and compare the results of different multi-word term extraction methods. Our approach is language and domain independent and requires no training data. It can be applied to such tasks as text summarization, information retrieval, and document classification. We further explore the potential of using multi-word terms as an effective representation for general web-page summarization. We extract multi-word terms from human written summaries in a large collection of web-pages, and generate the summaries by aligning document words with these multi-word terms. Our system applies machine translation technology to learn the aligning process from a training set and focuses on selecting high quality multi-word terms from human written summaries to generate suitable results for web-page summarization.
    Content
    A Thesis presented to The University of Guelph In partial fulfilment of requirements for the degree of Master of Science in Computer Science. Vgl. Unter: http://www.inf.ufrgs.br%2F~ceramisch%2Fdownload_files%2Fpublications%2F2009%2Fp01.pdf.
    Date
    10. 1.2013 19:22:47
    Imprint
    Guelph, Ontario : University of Guelph
  2. Colace, F.; Santo, M. De; Greco, L.; Napoletano, P.: Weighted word pairs for query expansion (2015) 0.05
    0.050663535 = product of:
      0.10132707 = sum of:
        0.041327372 = weight(_text_:retrieval in 2687) [ClassicSimilarity], result of:
          0.041327372 = score(doc=2687,freq=4.0), product of:
            0.124912694 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.041294612 = queryNorm
            0.33085006 = fieldWeight in 2687, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2687)
        0.029945528 = weight(_text_:use in 2687) [ClassicSimilarity], result of:
          0.029945528 = score(doc=2687,freq=2.0), product of:
            0.12644777 = queryWeight, product of:
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.041294612 = queryNorm
            0.23682132 = fieldWeight in 2687, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2687)
        0.019129815 = weight(_text_:of in 2687) [ClassicSimilarity], result of:
          0.019129815 = score(doc=2687,freq=12.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.29624295 = fieldWeight in 2687, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2687)
        0.010924355 = product of:
          0.02184871 = sum of:
            0.02184871 = weight(_text_:on in 2687) [ClassicSimilarity], result of:
              0.02184871 = score(doc=2687,freq=4.0), product of:
                0.090823986 = queryWeight, product of:
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.041294612 = queryNorm
                0.24056101 = fieldWeight in 2687, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=2687)
          0.5 = coord(1/2)
      0.5 = coord(4/8)
    
    Abstract
    This paper proposes a novel query expansion method to improve accuracy of text retrieval systems. Our method makes use of a minimal relevance feedback to expand the initial query with a structured representation composed of weighted pairs of words. Such a structure is obtained from the relevance feedback through a method for pairs of words selection based on the Probabilistic Topic Model. We compared our method with other baseline query expansion schemes and methods. Evaluations performed on TREC-8 demonstrated the effectiveness of the proposed method with respect to the baseline.
    Theme
    Semantisches Umfeld in Indexierung u. Retrieval
  3. Bedathur, S.; Narang, A.: Mind your language : effects of spoken query formulation on retrieval effectiveness (2013) 0.05
    0.04823032 = product of:
      0.09646064 = sum of:
        0.041327372 = weight(_text_:retrieval in 1150) [ClassicSimilarity], result of:
          0.041327372 = score(doc=1150,freq=4.0), product of:
            0.124912694 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.041294612 = queryNorm
            0.33085006 = fieldWeight in 1150, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1150)
        0.029945528 = weight(_text_:use in 1150) [ClassicSimilarity], result of:
          0.029945528 = score(doc=1150,freq=2.0), product of:
            0.12644777 = queryWeight, product of:
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.041294612 = queryNorm
            0.23682132 = fieldWeight in 1150, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1150)
        0.017463053 = weight(_text_:of in 1150) [ClassicSimilarity], result of:
          0.017463053 = score(doc=1150,freq=10.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.2704316 = fieldWeight in 1150, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1150)
        0.007724685 = product of:
          0.01544937 = sum of:
            0.01544937 = weight(_text_:on in 1150) [ClassicSimilarity], result of:
              0.01544937 = score(doc=1150,freq=2.0), product of:
                0.090823986 = queryWeight, product of:
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.041294612 = queryNorm
                0.17010231 = fieldWeight in 1150, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=1150)
          0.5 = coord(1/2)
      0.5 = coord(4/8)
    
    Abstract
    Voice search is becoming a popular mode for interacting with search engines. As a result, research has gone into building better voice transcription engines, interfaces, and search engines that better handle inherent verbosity of queries. However, when one considers its use by non- native speakers of English, another aspect that becomes important is the formulation of the query by users. In this paper, we present the results of a preliminary study that we conducted with non-native English speakers who formulate queries for given retrieval tasks. Our results show that the current search engines are sensitive in their rankings to the query formulation, and thus highlights the need for developing more robust ranking methods.
  4. Belbachir, F.; Boughanem, M.: Using language models to improve opinion detection (2018) 0.05
    0.04610965 = product of:
      0.0922193 = sum of:
        0.040903494 = weight(_text_:retrieval in 5044) [ClassicSimilarity], result of:
          0.040903494 = score(doc=5044,freq=12.0), product of:
            0.124912694 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.041294612 = queryNorm
            0.32745665 = fieldWeight in 5044, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.03125 = fieldNorm(doc=5044)
        0.02963839 = weight(_text_:use in 5044) [ClassicSimilarity], result of:
          0.02963839 = score(doc=5044,freq=6.0), product of:
            0.12644777 = queryWeight, product of:
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.041294612 = queryNorm
            0.23439234 = fieldWeight in 5044, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.03125 = fieldNorm(doc=5044)
        0.011807178 = weight(_text_:of in 5044) [ClassicSimilarity], result of:
          0.011807178 = score(doc=5044,freq=14.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.18284513 = fieldWeight in 5044, product of:
              3.7416575 = tf(freq=14.0), with freq of:
                14.0 = termFreq=14.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.03125 = fieldNorm(doc=5044)
        0.009870241 = product of:
          0.019740483 = sum of:
            0.019740483 = weight(_text_:on in 5044) [ClassicSimilarity], result of:
              0.019740483 = score(doc=5044,freq=10.0), product of:
                0.090823986 = queryWeight, product of:
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.041294612 = queryNorm
                0.21734878 = fieldWeight in 5044, product of:
                  3.1622777 = tf(freq=10.0), with freq of:
                    10.0 = termFreq=10.0
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.03125 = fieldNorm(doc=5044)
          0.5 = coord(1/2)
      0.5 = coord(4/8)
    
    Abstract
    Opinion mining is one of the most important research tasks in the information retrieval research community. With the huge volume of opinionated data available on the Web, approaches must be developed to differentiate opinion from fact. In this paper, we present a lexicon-based approach for opinion retrieval. Generally, opinion retrieval consists of two stages: relevance to the query and opinion detection. In our work, we focus on the second state which itself focusses on detecting opinionated documents . We compare the document to be analyzed with opinionated sources that contain subjective information. We hypothesize that a document with a strong similarity to opinionated sources is more likely to be opinionated itself. Typical lexicon-based approaches treat and choose their opinion sources according to their test collection, then calculate the opinion score based on the frequency of subjective terms in the document. In our work, we use different open opinion collections without any specific treatment and consider them as a reference collection. We then use language models to determine opinion scores. The analysis document and reference collection are represented by different language models (i.e., Dirichlet, Jelinek-Mercer and two-stage models). These language models are generally used in information retrieval to represent the relationship between documents and queries. However, in our study, we modify these language models to represent opinionated documents. We carry out several experiments using Text REtrieval Conference (TREC) Blogs 06 as our analysis collection and Internet Movie Data Bases (IMDB), Multi-Perspective Question Answering (MPQA) and CHESLY as our reference collection. To improve opinion detection, we study the impact of using different language models to represent the document and reference collection alongside different combinations of opinion and retrieval scores. We then use this data to deduce the best opinion detection models. Using the best models, our approach improves on the best baseline of TREC Blog (baseline4) by 30%.
  5. Kim, S.; Ko, Y.; Oard, D.W.: Combining lexical and statistical translation evidence for cross-language information retrieval (2015) 0.04
    0.041024607 = product of:
      0.08204921 = sum of:
        0.035423465 = weight(_text_:retrieval in 1606) [ClassicSimilarity], result of:
          0.035423465 = score(doc=1606,freq=4.0), product of:
            0.124912694 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.041294612 = queryNorm
            0.2835858 = fieldWeight in 1606, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.046875 = fieldNorm(doc=1606)
        0.025667597 = weight(_text_:use in 1606) [ClassicSimilarity], result of:
          0.025667597 = score(doc=1606,freq=2.0), product of:
            0.12644777 = queryWeight, product of:
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.041294612 = queryNorm
            0.20298971 = fieldWeight in 1606, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.046875 = fieldNorm(doc=1606)
        0.011594418 = weight(_text_:of in 1606) [ClassicSimilarity], result of:
          0.011594418 = score(doc=1606,freq=6.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.17955035 = fieldWeight in 1606, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=1606)
        0.009363732 = product of:
          0.018727465 = sum of:
            0.018727465 = weight(_text_:on in 1606) [ClassicSimilarity], result of:
              0.018727465 = score(doc=1606,freq=4.0), product of:
                0.090823986 = queryWeight, product of:
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.041294612 = queryNorm
                0.20619515 = fieldWeight in 1606, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.046875 = fieldNorm(doc=1606)
          0.5 = coord(1/2)
      0.5 = coord(4/8)
    
    Abstract
    This article explores how best to use lexical and statistical translation evidence together for cross-language information retrieval (CLIR). Lexical translation evidence is assembled from Wikipedia and from a large machine-readable dictionary, statistical translation evidence is drawn from parallel corpora, and evidence from co-occurrence in the document language provides a basis for limiting the adverse effect of translation ambiguity. Coverage statistics for NII Testbeds and Community for Information Access Research (NTCIR) queries confirm that these resources have complementary strengths. Experiments with translation evidence from a small parallel corpus indicate that even rather rough estimates of translation probabilities can yield further improvements over a strong technique for translation weighting based on using Jensen-Shannon divergence as a term-association measure. Finally, a novel approach to posttranslation query expansion using a random walk over the Wikipedia concept link graph is shown to yield further improvements over alternative techniques for posttranslation query expansion. Evaluation results on the NTCIR-5 English-Korean test collection show statistically significant improvements over strong baselines.
    Source
    Journal of the Association for Information Science and Technology. 66(2015) no.1, S.23-39
  6. Lu, K.; Cai, X.; Ajiferuke, I.; Wolfram, D.: Vocabulary size and its effect on topic representation (2017) 0.04
    0.04062396 = product of:
      0.08124792 = sum of:
        0.025048172 = weight(_text_:retrieval in 3414) [ClassicSimilarity], result of:
          0.025048172 = score(doc=3414,freq=2.0), product of:
            0.124912694 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.041294612 = queryNorm
            0.20052543 = fieldWeight in 3414, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.046875 = fieldNorm(doc=3414)
        0.025667597 = weight(_text_:use in 3414) [ClassicSimilarity], result of:
          0.025667597 = score(doc=3414,freq=2.0), product of:
            0.12644777 = queryWeight, product of:
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.041294612 = queryNorm
            0.20298971 = fieldWeight in 3414, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.046875 = fieldNorm(doc=3414)
        0.021168415 = weight(_text_:of in 3414) [ClassicSimilarity], result of:
          0.021168415 = score(doc=3414,freq=20.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.32781258 = fieldWeight in 3414, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=3414)
        0.009363732 = product of:
          0.018727465 = sum of:
            0.018727465 = weight(_text_:on in 3414) [ClassicSimilarity], result of:
              0.018727465 = score(doc=3414,freq=4.0), product of:
                0.090823986 = queryWeight, product of:
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.041294612 = queryNorm
                0.20619515 = fieldWeight in 3414, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.046875 = fieldNorm(doc=3414)
          0.5 = coord(1/2)
      0.5 = coord(4/8)
    
    Abstract
    This study investigates how computational overhead for topic model training may be reduced by selectively removing terms from the vocabulary of text corpora being modeled. We compare the impact of removing singly occurring terms, the top 0.5%, 1% and 5% most frequently occurring terms and both top 0.5% most frequent and singly occurring terms, along with changes in the number of topics modeled (10, 20, 30, 40, 50, 100) using three datasets. Four outcome measures are compared. The removal of singly occurring terms has little impact on outcomes for all of the measures tested. Document discriminative capacity, as measured by the document space density, is reduced by the removal of frequently occurring terms, but increases with higher numbers of topics. Vocabulary size does not greatly influence entropy, but entropy is affected by the number of topics. Finally, topic similarity, as measured by pairwise topic similarity and Jensen-Shannon divergence, decreases with the removal of frequent terms. The findings have implications for information science research in information retrieval and informetrics that makes use of topic modeling.
  7. Fernández, R.T.; Losada, D.E.: Effective sentence retrieval based on query-independent evidence (2012) 0.04
    0.03569212 = product of:
      0.09517899 = sum of:
        0.07084693 = weight(_text_:retrieval in 2728) [ClassicSimilarity], result of:
          0.07084693 = score(doc=2728,freq=16.0), product of:
            0.124912694 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.041294612 = queryNorm
            0.5671716 = fieldWeight in 2728, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.046875 = fieldNorm(doc=2728)
        0.014968331 = weight(_text_:of in 2728) [ClassicSimilarity], result of:
          0.014968331 = score(doc=2728,freq=10.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.23179851 = fieldWeight in 2728, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=2728)
        0.009363732 = product of:
          0.018727465 = sum of:
            0.018727465 = weight(_text_:on in 2728) [ClassicSimilarity], result of:
              0.018727465 = score(doc=2728,freq=4.0), product of:
                0.090823986 = queryWeight, product of:
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.041294612 = queryNorm
                0.20619515 = fieldWeight in 2728, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2728)
          0.5 = coord(1/2)
      0.375 = coord(3/8)
    
    Abstract
    In this paper we propose an effective sentence retrieval method that consists of incorporating query-independent features into standard sentence retrieval models. To meet this aim, we apply a formal methodology and consider different query-independent features. In particular, we show that opinion-based features are promising. Opinion mining is an increasingly important research topic but little is known about how to improve retrieval algorithms with opinion-based components. In this respect, we consider here different kinds of opinion-based features to act as query-independent evidence and study whether this incorporation improves retrieval performance. On the other hand, information needs are usually related to people, locations or organizations. We hypothesize here that using these named entities as query-independent features may also improve the sentence relevance estimation. Finally, the length of the retrieval unit has been shown to be an important component in different retrieval scenarios. We therefore include length-based features in our study. Our evaluation demonstrates that, either in isolation or in combination, these query-independent features help to improve substantially the performance of state-of-the-art sentence retrieval methods.
  8. Rajasurya, S.; Muralidharan, T.; Devi, S.; Swamynathan, S.: Semantic information retrieval using ontology in university domain (2012) 0.03
    0.03445023 = product of:
      0.06890046 = sum of:
        0.029519552 = weight(_text_:retrieval in 2861) [ClassicSimilarity], result of:
          0.029519552 = score(doc=2861,freq=4.0), product of:
            0.124912694 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.041294612 = queryNorm
            0.23632148 = fieldWeight in 2861, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2861)
        0.021389665 = weight(_text_:use in 2861) [ClassicSimilarity], result of:
          0.021389665 = score(doc=2861,freq=2.0), product of:
            0.12644777 = queryWeight, product of:
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.041294612 = queryNorm
            0.1691581 = fieldWeight in 2861, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2861)
        0.012473608 = weight(_text_:of in 2861) [ClassicSimilarity], result of:
          0.012473608 = score(doc=2861,freq=10.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.19316542 = fieldWeight in 2861, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2861)
        0.0055176322 = product of:
          0.0110352645 = sum of:
            0.0110352645 = weight(_text_:on in 2861) [ClassicSimilarity], result of:
              0.0110352645 = score(doc=2861,freq=2.0), product of:
                0.090823986 = queryWeight, product of:
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.041294612 = queryNorm
                0.121501654 = fieldWeight in 2861, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2861)
          0.5 = coord(1/2)
      0.5 = coord(4/8)
    
    Abstract
    Today's conventional search engines hardly do provide the essential content relevant to the user's search query. This is because the context and semantics of the request made by the user is not analyzed to the full extent. So here the need for a semantic web search arises. SWS is upcoming in the area of web search which combines Natural Language Processing and Artificial Intelligence. The objective of the work done here is to design, develop and implement a semantic search engine- SIEU(Semantic Information Extraction in University Domain) confined to the university domain. SIEU uses ontology as a knowledge base for the information retrieval process. It is not just a mere keyword search. It is one layer above what Google or any other search engines retrieve by analyzing just the keywords. Here the query is analyzed both syntactically and semantically. The developed system retrieves the web results more relevant to the user query through keyword expansion. The results obtained here will be accurate enough to satisfy the request made by the user. The level of accuracy will be enhanced since the query is analyzed semantically. The system will be of great use to the developers and researchers who work on web. The Google results are re-ranked and optimized for providing the relevant links. For ranking an algorithm has been applied which fetches more apt results for the user query.
  9. Vlachidis, A.; Binding, C.; Tudhope, D.; May, K.: Excavating grey literature : a case study on the rich indexing of archaeological documents via natural language-processing techniques and knowledge-based resources (2010) 0.03
    0.0340364 = product of:
      0.0680728 = sum of:
        0.016698781 = weight(_text_:retrieval in 3948) [ClassicSimilarity], result of:
          0.016698781 = score(doc=3948,freq=2.0), product of:
            0.124912694 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.041294612 = queryNorm
            0.13368362 = fieldWeight in 3948, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.03125 = fieldNorm(doc=3948)
        0.024199642 = weight(_text_:use in 3948) [ClassicSimilarity], result of:
          0.024199642 = score(doc=3948,freq=4.0), product of:
            0.12644777 = queryWeight, product of:
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.041294612 = queryNorm
            0.19138055 = fieldWeight in 3948, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.03125 = fieldNorm(doc=3948)
        0.02093189 = weight(_text_:of in 3948) [ClassicSimilarity], result of:
          0.02093189 = score(doc=3948,freq=44.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.3241498 = fieldWeight in 3948, product of:
              6.6332498 = tf(freq=44.0), with freq of:
                44.0 = termFreq=44.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.03125 = fieldNorm(doc=3948)
        0.0062424885 = product of:
          0.012484977 = sum of:
            0.012484977 = weight(_text_:on in 3948) [ClassicSimilarity], result of:
              0.012484977 = score(doc=3948,freq=4.0), product of:
                0.090823986 = queryWeight, product of:
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.041294612 = queryNorm
                0.13746344 = fieldWeight in 3948, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.03125 = fieldNorm(doc=3948)
          0.5 = coord(1/2)
      0.5 = coord(4/8)
    
    Abstract
    Purpose - This paper sets out to discuss the use of information extraction (IE), a natural language-processing (NLP) technique to assist "rich" semantic indexing of diverse archaeological text resources. The focus of the research is to direct a semantic-aware "rich" indexing of diverse natural language resources with properties capable of satisfying information retrieval from online publications and datasets associated with the Semantic Technologies for Archaeological Resources (STAR) project. Design/methodology/approach - The paper proposes use of the English Heritage extension (CRM-EH) of the standard core ontology in cultural heritage, CIDOC CRM, and exploitation of domain thesauri resources for driving and enhancing an Ontology-Oriented Information Extraction process. The process of semantic indexing is based on a rule-based Information Extraction technique, which is facilitated by the General Architecture of Text Engineering (GATE) toolkit and expressed by Java Annotation Pattern Engine (JAPE) rules. Findings - Initial results suggest that the combination of information extraction with knowledge resources and standard conceptual models is capable of supporting semantic-aware term indexing. Additional efforts are required for further exploitation of the technique and adoption of formal evaluation methods for assessing the performance of the method in measurable terms. Originality/value - The value of the paper lies in the semantic indexing of 535 unpublished online documents often referred to as "Grey Literature", from the Archaeological Data Service OASIS corpus (Online AccesS to the Index of archaeological investigationS), with respect to the CRM ontological concepts E49.Time Appellation and P19.Physical Object.
    Footnote
    Beitrag in einem Special Issue: Content architecture: exploiting and managing diverse resources: proceedings of the first national conference of the United Kingdom chapter of the International Society for Knowedge Organization (ISKO)
  10. Fóris, A.: Network theory and terminology (2013) 0.03
    0.032280713 = product of:
      0.06456143 = sum of:
        0.021389665 = weight(_text_:use in 1365) [ClassicSimilarity], result of:
          0.021389665 = score(doc=1365,freq=2.0), product of:
            0.12644777 = queryWeight, product of:
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.041294612 = queryNorm
            0.1691581 = fieldWeight in 1365, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1365)
        0.023667008 = weight(_text_:of in 1365) [ClassicSimilarity], result of:
          0.023667008 = score(doc=1365,freq=36.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.36650562 = fieldWeight in 1365, product of:
              6.0 = tf(freq=36.0), with freq of:
                36.0 = termFreq=36.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1365)
        0.0055176322 = product of:
          0.0110352645 = sum of:
            0.0110352645 = weight(_text_:on in 1365) [ClassicSimilarity], result of:
              0.0110352645 = score(doc=1365,freq=2.0), product of:
                0.090823986 = queryWeight, product of:
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.041294612 = queryNorm
                0.121501654 = fieldWeight in 1365, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1365)
          0.5 = coord(1/2)
        0.013987125 = product of:
          0.02797425 = sum of:
            0.02797425 = weight(_text_:22 in 1365) [ClassicSimilarity], result of:
              0.02797425 = score(doc=1365,freq=2.0), product of:
                0.1446067 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.041294612 = queryNorm
                0.19345059 = fieldWeight in 1365, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1365)
          0.5 = coord(1/2)
      0.5 = coord(4/8)
    
    Abstract
    The paper aims to present the relations of network theory and terminology. The model of scale-free networks, which has been recently developed and widely applied since, can be effectively used in terminology research as well. Operation based on the principle of networks is a universal characteristic of complex systems. Networks are governed by general laws. The model of scale-free networks can be viewed as a statistical-probability model, and it can be described with mathematical tools. Its main feature is that "everything is connected to everything else," that is, every node is reachable (in a few steps) starting from any other node; this phenomena is called "the small world phenomenon." The existence of a linguistic network and the general laws of the operation of networks enable us to place issues of language use in the complex system of relations that reveal the deeper connection s between phenomena with the help of networks embedded in each other. The realization of the metaphor that language also has a network structure is the basis of the classification methods of the terminological system, and likewise of the ways of creating terminology databases, which serve the purpose of providing easy and versatile accessibility to specialised knowledge.
    Content
    Beitrag im Rahmen eines Special Issue: 'Paradigms of Knowledge and its Organization: The Tree, the Net and Beyond,' edited by Fulvio Mazzocchi and Gian Carlo Fedeli. - Vgl.: http://www.ergon-verlag.de/isko_ko/downloads/ko_40_2013_6_i.pdf.
    Date
    2. 9.2014 21:22:48
  11. Luo, Z.; Yu, Y.; Osborne, M.; Wang, T.: Structuring tweets for improving Twitter search (2015) 0.03
    0.031178337 = product of:
      0.08314223 = sum of:
        0.055226028 = weight(_text_:retrieval in 2335) [ClassicSimilarity], result of:
          0.055226028 = score(doc=2335,freq=14.0), product of:
            0.124912694 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.041294612 = queryNorm
            0.442117 = fieldWeight in 2335, product of:
              3.7416575 = tf(freq=14.0), with freq of:
                14.0 = termFreq=14.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2335)
        0.02011309 = weight(_text_:of in 2335) [ClassicSimilarity], result of:
          0.02011309 = score(doc=2335,freq=26.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.31146988 = fieldWeight in 2335, product of:
              5.0990195 = tf(freq=26.0), with freq of:
                26.0 = termFreq=26.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2335)
        0.007803111 = product of:
          0.015606222 = sum of:
            0.015606222 = weight(_text_:on in 2335) [ClassicSimilarity], result of:
              0.015606222 = score(doc=2335,freq=4.0), product of:
                0.090823986 = queryWeight, product of:
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.041294612 = queryNorm
                0.1718293 = fieldWeight in 2335, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2335)
          0.5 = coord(1/2)
      0.375 = coord(3/8)
    
    Abstract
    Spam and wildly varying documents make searching in Twitter challenging. Most Twitter search systems generally treat a Tweet as a plain text when modeling relevance. However, a series of conventions allows users to Tweet in structural ways using a combination of different blocks of texts. These blocks include plain texts, hashtags, links, mentions, etc. Each block encodes a variety of communicative intent and the sequence of these blocks captures changing discourse. Previous work shows that exploiting the structural information can improve the structured documents (e.g., web pages) retrieval. In this study we utilize the structure of Tweets, induced by these blocks, for Twitter retrieval and Twitter opinion retrieval. For Twitter retrieval, a set of features, derived from the blocks of text and their combinations, is used into a learning-to-rank scenario. We show that structuring Tweets can achieve state-of-the-art performance. Our approach does not rely on social media features, but when we do add this additional information, performance improves significantly. For Twitter opinion retrieval, we explore the question of whether structural information derived from the body of Tweets and opinionatedness ratings of Tweets can improve performance. Experimental results show that retrieval using a novel unsupervised opinionatedness feature based on structuring Tweets achieves comparable performance with a supervised method using manually tagged Tweets. Topic-related specific structured Tweet sets are shown to help with query-dependent opinion retrieval.
    Source
    Journal of the Association for Information Science and Technology. 66(2015) no.12, S.2522-2539
  12. Spitkovsky, V.I.; Chang, A.X.: ¬A cross-lingual dictionary for english Wikipedia concepts (2012) 0.03
    0.030787328 = product of:
      0.08209954 = sum of:
        0.025667597 = weight(_text_:use in 336) [ClassicSimilarity], result of:
          0.025667597 = score(doc=336,freq=2.0), product of:
            0.12644777 = queryWeight, product of:
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.041294612 = queryNorm
            0.20298971 = fieldWeight in 336, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.046875 = fieldNorm(doc=336)
        0.013388081 = weight(_text_:of in 336) [ClassicSimilarity], result of:
          0.013388081 = score(doc=336,freq=8.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.20732689 = fieldWeight in 336, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=336)
        0.043043863 = product of:
          0.086087726 = sum of:
            0.086087726 = weight(_text_:line in 336) [ClassicSimilarity], result of:
              0.086087726 = score(doc=336,freq=2.0), product of:
                0.23157367 = queryWeight, product of:
                  5.6078424 = idf(docFreq=440, maxDocs=44218)
                  0.041294612 = queryNorm
                0.37175092 = fieldWeight in 336, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  5.6078424 = idf(docFreq=440, maxDocs=44218)
                  0.046875 = fieldNorm(doc=336)
          0.5 = coord(1/2)
      0.375 = coord(3/8)
    
    Abstract
    We present a resource for automatically associating strings of text with English Wikipedia concepts. Our machinery is bi-directional, in the sense that it uses the same fundamental probabilistic methods to map strings to empirical distributions over Wikipedia articles as it does to map article URLs to distributions over short, language-independent strings of natural language text. For maximal interoperability, we release our resource as a set of ?at line-based text ?les, lexicographically sorted and encoded with UTF-8. These files capture joint probability distributions underlying concepts (we use the terms article, concept and Wikipedia URL interchangeably) and associated snippets of text, as well as other features that can come in handy when working with Wikipedia articles and related information.
  13. Symonds, M.; Bruza, P.; Zuccon, G.; Koopman, B.; Sitbon, L.; Turner, I.: Automatic query expansion : a structural linguistic perspective (2014) 0.03
    0.029587397 = product of:
      0.078899726 = sum of:
        0.051129367 = weight(_text_:retrieval in 1338) [ClassicSimilarity], result of:
          0.051129367 = score(doc=1338,freq=12.0), product of:
            0.124912694 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.041294612 = queryNorm
            0.40932083 = fieldWeight in 1338, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1338)
        0.0167351 = weight(_text_:of in 1338) [ClassicSimilarity], result of:
          0.0167351 = score(doc=1338,freq=18.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.25915858 = fieldWeight in 1338, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1338)
        0.0110352645 = product of:
          0.022070529 = sum of:
            0.022070529 = weight(_text_:on in 1338) [ClassicSimilarity], result of:
              0.022070529 = score(doc=1338,freq=8.0), product of:
                0.090823986 = queryWeight, product of:
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.041294612 = queryNorm
                0.24300331 = fieldWeight in 1338, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1338)
          0.5 = coord(1/2)
      0.375 = coord(3/8)
    
    Abstract
    A user's query is considered to be an imprecise description of their information need. Automatic query expansion is the process of reformulating the original query with the goal of improving retrieval effectiveness. Many successful query expansion techniques model syntagmatic associations that infer two terms co-occur more often than by chance in natural language. However, structural linguistics relies on both syntagmatic and paradigmatic associations to deduce the meaning of a word. Given the success of dependency-based approaches to query expansion and the reliance on word meanings in the query formulation process, we argue that modeling both syntagmatic and paradigmatic information in the query expansion process improves retrieval effectiveness. This article develops and evaluates a new query expansion technique that is based on a formal, corpus-based model of word meaning that models syntagmatic and paradigmatic associations. We demonstrate that when sufficient statistical information exists, as in the case of longer queries, including paradigmatic information alone provides significant improvements in retrieval effectiveness across a wide variety of data sets. More generally, when our new query expansion approach is applied to large-scale web retrieval it demonstrates significant improvements in retrieval effectiveness over a strong baseline system, based on a commercial search engine.
    Source
    Journal of the Association for Information Science and Technology. 65(2014) no.8, S.1577-1596
    Theme
    Semantisches Umfeld in Indexierung u. Retrieval
  14. Nagy T., I.: Detecting multiword expressions and named entities in natural language texts (2014) 0.03
    0.02729469 = product of:
      0.05458938 = sum of:
        0.014611433 = weight(_text_:retrieval in 1536) [ClassicSimilarity], result of:
          0.014611433 = score(doc=1536,freq=2.0), product of:
            0.124912694 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.041294612 = queryNorm
            0.11697317 = fieldWeight in 1536, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.02734375 = fieldNorm(doc=1536)
        0.014972764 = weight(_text_:use in 1536) [ClassicSimilarity], result of:
          0.014972764 = score(doc=1536,freq=2.0), product of:
            0.12644777 = queryWeight, product of:
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.041294612 = queryNorm
            0.11841066 = fieldWeight in 1536, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.02734375 = fieldNorm(doc=1536)
        0.018315405 = weight(_text_:of in 1536) [ClassicSimilarity], result of:
          0.018315405 = score(doc=1536,freq=44.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.28363106 = fieldWeight in 1536, product of:
              6.6332498 = tf(freq=44.0), with freq of:
                44.0 = termFreq=44.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.02734375 = fieldNorm(doc=1536)
        0.0066897743 = product of:
          0.013379549 = sum of:
            0.013379549 = weight(_text_:on in 1536) [ClassicSimilarity], result of:
              0.013379549 = score(doc=1536,freq=6.0), product of:
                0.090823986 = queryWeight, product of:
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.041294612 = queryNorm
                0.14731294 = fieldWeight in 1536, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.02734375 = fieldNorm(doc=1536)
          0.5 = coord(1/2)
      0.5 = coord(4/8)
    
    Abstract
    Multiword expressions (MWEs) are lexical items that can be decomposed into single words and display lexical, syntactic, semantic, pragmatic and/or statistical idiosyncrasy (Sag et al., 2002; Kim, 2008; Calzolari et al., 2002). The proper treatment of multiword expressions such as rock 'n' roll and make a decision is essential for many natural language processing (NLP) applications like information extraction and retrieval, terminology extraction and machine translation, and it is important to identify multiword expressions in context. For example, in machine translation we must know that MWEs form one semantic unit, hence their parts should not be translated separately. For this, multiword expressions should be identified first in the text to be translated. The chief aim of this thesis is to develop machine learning-based approaches for the automatic detection of different types of multiword expressions in English and Hungarian natural language texts. In our investigations, we pay attention to the characteristics of different types of multiword expressions such as nominal compounds, multiword named entities and light verb constructions, and we apply novel methods to identify MWEs in raw texts. In the thesis it will be demonstrated that nominal compounds and multiword amed entities may require a similar approach for their automatic detection as they behave in the same way from a linguistic point of view. Furthermore, it will be shown that the automatic detection of light verb constructions can be carried out using two effective machine learning-based approaches.
    In this thesis, we focused on the automatic detection of multiword expressions in natural language texts. On the basis of the main contributions, we can argue that: - Supervised machine learning methods can be successfully applied for the automatic detection of different types of multiword expressions in natural language texts. - Machine learning-based multiword expression detection can be successfully carried out for English as well as for Hungarian. - Our supervised machine learning-based model was successfully applied to the automatic detection of nominal compounds from English raw texts. - We developed a Wikipedia-based dictionary labeling method to automatically detect English nominal compounds. - A prior knowledge of nominal compounds can enhance Named Entity Recognition, while previously identified named entities can assist the nominal compound identification process. - The machine learning-based method can also provide acceptable results when it was trained on an automatically generated silver standard corpus. - As named entities form one semantic unit and may consist of more than one word and function as a noun, we can treat them in a similar way to nominal compounds. - Our sequence labelling-based tool can be successfully applied for identifying verbal light verb constructions in two typologically different languages, namely English and Hungarian. - Domain adaptation techniques may help diminish the distance between domains in the automatic detection of light verb constructions. - Our syntax-based method can be successfully applied for the full-coverage identification of light verb constructions. As a first step, a data-driven candidate extraction method can be utilized. After, a machine learning approach that makes use of an extended and rich feature set selects LVCs among extracted candidates. - When a precise syntactic parser is available for the actual domain, the full-coverage identification can be performed better. In other cases, the usage of the sequence labeling method is recommended.
    Imprint
    Szeged : University of Szeged, Faculty of Science and Informatics, Doctoral School of Computer Science
  15. Järvelin, A.; Keskustalo, H.; Sormunen, E.; Saastamoinen, M.; Kettunen, K.: Information retrieval from historical newspaper collections in highly inflectional languages : a query expansion approach (2016) 0.02
    0.024970733 = product of:
      0.06658862 = sum of:
        0.04174695 = weight(_text_:retrieval in 3223) [ClassicSimilarity], result of:
          0.04174695 = score(doc=3223,freq=8.0), product of:
            0.124912694 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.041294612 = queryNorm
            0.33420905 = fieldWeight in 3223, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3223)
        0.019324033 = weight(_text_:of in 3223) [ClassicSimilarity], result of:
          0.019324033 = score(doc=3223,freq=24.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.2992506 = fieldWeight in 3223, product of:
              4.8989797 = tf(freq=24.0), with freq of:
                24.0 = termFreq=24.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3223)
        0.0055176322 = product of:
          0.0110352645 = sum of:
            0.0110352645 = weight(_text_:on in 3223) [ClassicSimilarity], result of:
              0.0110352645 = score(doc=3223,freq=2.0), product of:
                0.090823986 = queryWeight, product of:
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.041294612 = queryNorm
                0.121501654 = fieldWeight in 3223, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3223)
          0.5 = coord(1/2)
      0.375 = coord(3/8)
    
    Abstract
    The aim of the study was to test whether query expansion by approximate string matching methods is beneficial in retrieval from historical newspaper collections in a language rich with compounds and inflectional forms (Finnish). First, approximate string matching methods were used to generate lists of index words most similar to contemporary query terms in a digitized newspaper collection from the 1800s. Top index word variants were categorized to estimate the appropriate query expansion ranges in the retrieval test. Second, the effectiveness of approximate string matching methods, automatically generated inflectional forms, and their combinations were measured in a Cranfield-style test. Finally, a detailed topic-level analysis of test results was conducted. In the index of historical newspaper collection the occurrences of a word typically spread to many linguistic and historical variants along with optical character recognition (OCR) errors. All query expansion methods improved the baseline results. Extensive expansion of around 30 variants for each query word was required to achieve the highest performance improvement. Query expansion based on approximate string matching was superior to using the inflectional forms of the query words, showing that coverage of the different types of variation is more important than precision in handling one type of variation.
    Source
    Journal of the Association for Information Science and Technology. 67(2016) no.12, S.2928-2946
    Theme
    Semantisches Umfeld in Indexierung u. Retrieval
  16. Li, N.; Sun, J.: Improving Chinese term association from the linguistic perspective (2017) 0.02
    0.024631536 = product of:
      0.065684095 = sum of:
        0.025048172 = weight(_text_:retrieval in 3381) [ClassicSimilarity], result of:
          0.025048172 = score(doc=3381,freq=2.0), product of:
            0.124912694 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.041294612 = queryNorm
            0.20052543 = fieldWeight in 3381, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.046875 = fieldNorm(doc=3381)
        0.025667597 = weight(_text_:use in 3381) [ClassicSimilarity], result of:
          0.025667597 = score(doc=3381,freq=2.0), product of:
            0.12644777 = queryWeight, product of:
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.041294612 = queryNorm
            0.20298971 = fieldWeight in 3381, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.046875 = fieldNorm(doc=3381)
        0.014968331 = weight(_text_:of in 3381) [ClassicSimilarity], result of:
          0.014968331 = score(doc=3381,freq=10.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.23179851 = fieldWeight in 3381, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=3381)
      0.375 = coord(3/8)
    
    Abstract
    The study aims to solve how to construct the semantic relations of specific domain terms by applying linguistic rules. The semantic structure analysis at the morpheme level was used for semantic measure, and a morpheme-based term association model was proposed by improving and combining the literal-based similarity algorithm and co-occurrence relatedness methods. This study provides a novel insight into the method of semantic analysis and calculation by morpheme parsing, and the proposed solution is feasible for the automatic association of compound terms. The results show that this approach could be used to construct appropriate term association and form a reasonable structural knowledge graph. However, due to linguistic differences, the viability and effectiveness of the use of our method in non-Chinese linguistic environments should be verified.
    Theme
    Semantisches Umfeld in Indexierung u. Retrieval
  17. Ye, Z.; He, B.; Wang, L.; Luo, T.: Utilizing term proximity for blog post retrieval (2013) 0.02
    0.023999881 = product of:
      0.06399968 = sum of:
        0.04174695 = weight(_text_:retrieval in 1126) [ClassicSimilarity], result of:
          0.04174695 = score(doc=1126,freq=8.0), product of:
            0.124912694 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.041294612 = queryNorm
            0.33420905 = fieldWeight in 1126, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1126)
        0.0167351 = weight(_text_:of in 1126) [ClassicSimilarity], result of:
          0.0167351 = score(doc=1126,freq=18.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.25915858 = fieldWeight in 1126, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1126)
        0.0055176322 = product of:
          0.0110352645 = sum of:
            0.0110352645 = weight(_text_:on in 1126) [ClassicSimilarity], result of:
              0.0110352645 = score(doc=1126,freq=2.0), product of:
                0.090823986 = queryWeight, product of:
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.041294612 = queryNorm
                0.121501654 = fieldWeight in 1126, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1126)
          0.5 = coord(1/2)
      0.375 = coord(3/8)
    
    Abstract
    Term proximity is effective for many information retrieval (IR) research fields yet remains unexplored in blogosphere IR. The blogosphere is characterized by large amounts of noise, including incohesive, off-topic content and spam. Consequently, the classical bag-of-words unigram IR models are not reliable enough to provide robust and effective retrieval performance. In this article, we propose to boost the blog postretrieval performance by employing term proximity information. We investigate a variety of popular and state-of-the-art proximity-based statistical IR models, including a proximity-based counting model, the Markov random field (MRF) model, and the divergence from randomness (DFR) multinomial model. Extensive experimentation on the standard TREC Blog06 test dataset demonstrates that the introduction of term proximity information is indeed beneficial to retrieval from the blogosphere. Results also indicate the superiority of the unordered bi-gram model with the sequential-dependence phrases over other variants of the proximity-based models. Finally, inspired by the effectiveness of proximity models, we extend our study by exploring the proximity evidence between query terms and opinionated terms. The consequent opinionated proximity model shows promising performance in the experiments.
    Source
    Journal of the American Society for Information Science and Technology. 64(2013) no.11, S.2278-2298
  18. Deventer, J.P. van; Kruger, C.J.; Johnson, R.D.: Delineating knowledge management through lexical analysis : a retrospective (2015) 0.02
    0.023536477 = product of:
      0.047072954 = sum of:
        0.014972764 = weight(_text_:use in 3807) [ClassicSimilarity], result of:
          0.014972764 = score(doc=3807,freq=2.0), product of:
            0.12644777 = queryWeight, product of:
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.041294612 = queryNorm
            0.11841066 = fieldWeight in 3807, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.0620887 = idf(docFreq=5623, maxDocs=44218)
              0.02734375 = fieldNorm(doc=3807)
        0.015619429 = weight(_text_:of in 3807) [ClassicSimilarity], result of:
          0.015619429 = score(doc=3807,freq=32.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.24188137 = fieldWeight in 3807, product of:
              5.656854 = tf(freq=32.0), with freq of:
                32.0 = termFreq=32.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.02734375 = fieldNorm(doc=3807)
        0.0066897743 = product of:
          0.013379549 = sum of:
            0.013379549 = weight(_text_:on in 3807) [ClassicSimilarity], result of:
              0.013379549 = score(doc=3807,freq=6.0), product of:
                0.090823986 = queryWeight, product of:
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.041294612 = queryNorm
                0.14731294 = fieldWeight in 3807, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.02734375 = fieldNorm(doc=3807)
          0.5 = coord(1/2)
        0.009790987 = product of:
          0.019581974 = sum of:
            0.019581974 = weight(_text_:22 in 3807) [ClassicSimilarity], result of:
              0.019581974 = score(doc=3807,freq=2.0), product of:
                0.1446067 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.041294612 = queryNorm
                0.1354154 = fieldWeight in 3807, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.02734375 = fieldNorm(doc=3807)
          0.5 = coord(1/2)
      0.5 = coord(4/8)
    
    Abstract
    Purpose Academic authors tend to define terms that meet their own needs. Knowledge Management (KM) is a term that comes to mind and is examined in this study. Lexicographical research identified KM terms used by authors from 1996 to 2006 in academic outlets to define KM. Data were collected based on strict criteria which included that definitions should be unique instances. From 2006 onwards, these authors could not identify new unique instances of definitions with repetitive usage of such definition instances. Analysis revealed that KM is directly defined by People (Person and Organisation), Processes (Codify, Share, Leverage, and Process) and Contextualised Content (Information). The paper aims to discuss these issues. Design/methodology/approach The aim of this paper is to add to the body of knowledge in the KM discipline and supply KM practitioners and scholars with insight into what is commonly regarded to be KM so as to reignite the debate on what one could consider as KM. The lexicon used by KM scholars was evaluated though the application of lexicographical research methods as extended though Knowledge Discovery and Text Analysis methods. Findings By simplifying term relationships through the application of lexicographical research methods, as extended though Knowledge Discovery and Text Analysis methods, it was found that KM is directly defined by People (Person and Organisation), Processes (Codify, Share, Leverage, Process) and Contextualised Content (Information). One would therefore be able to indicate that KM, from an academic point of view, refers to people processing contextualised content.
    Research limitations/implications In total, 42 definitions were identified spanning a period of 11 years. This represented the first use of KM through the estimated apex of terms used. From 2006 onwards definitions were used in repetition, and all definitions that were considered to repeat were therefore subsequently excluded as not being unique instances. All definitions listed are by no means complete and exhaustive. The definitions are viewed outside the scope and context in which they were originally formulated and then used to review the key concepts in the definitions themselves. Social implications When the authors refer to the aforementioned discussion of KM content as well as the presentation of the method followed in this paper, the authors may have a few implications for future research in KM. First the research validates ideas presented by the OECD in 2005 pertaining to KM. It also validates that through the evolution of KM, the authors ended with a description of KM that may be seen as a standardised description. If the authors as academics and practitioners, for example, refer to KM as the same construct and/or idea, it has the potential to speculatively, distinguish between what KM may or may not be. Originality/value By simplifying the term used to define KM, by focusing on the most common definitions, the paper assist in refocusing KM by reconsidering the dimensions that is the most common in how it has been defined over time. This would hopefully assist in reigniting discussions about KM and how it may be used to the benefit of an organisation.
    Date
    20. 1.2015 18:30:22
    Source
    Aslib journal of information management. 67(2015) no.2, S.203-229
  19. Anizi, M.; Dichy, J.: Improving information retrieval in Arabic through a multi-agent approach and a rich lexical resource (2011) 0.02
    0.023238916 = product of:
      0.061970443 = sum of:
        0.029519552 = weight(_text_:retrieval in 4738) [ClassicSimilarity], result of:
          0.029519552 = score(doc=4738,freq=4.0), product of:
            0.124912694 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.041294612 = queryNorm
            0.23632148 = fieldWeight in 4738, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4738)
        0.02011309 = weight(_text_:of in 4738) [ClassicSimilarity], result of:
          0.02011309 = score(doc=4738,freq=26.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.31146988 = fieldWeight in 4738, product of:
              5.0990195 = tf(freq=26.0), with freq of:
                26.0 = termFreq=26.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4738)
        0.012337802 = product of:
          0.024675604 = sum of:
            0.024675604 = weight(_text_:on in 4738) [ClassicSimilarity], result of:
              0.024675604 = score(doc=4738,freq=10.0), product of:
                0.090823986 = queryWeight, product of:
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.041294612 = queryNorm
                0.271686 = fieldWeight in 4738, product of:
                  3.1622777 = tf(freq=10.0), with freq of:
                    10.0 = termFreq=10.0
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=4738)
          0.5 = coord(1/2)
      0.375 = coord(3/8)
    
    Abstract
    This paper addresses the optimization of information retrieval in Arabic. The results derived from the expanding development of sites in Arabic are often spectacular. Nevertheless, several observations indicate that the responses remain disappointing, particularly upon comparing users' requests and quality of responses. One of the problems encountered by users is the loss of time when navigating between different URLs to find adequate responses. This, in many cases, is due to the absence of forms morphologically related to the research keyword. Such problems can be approached through a morphological analyzer drawing on the DIINAR.1 morpho-lexical resource. A second problem concerns the formulation of the query, which may prove ambiguous, as in everyday language. We then focus on contextual disambiguation based on a rich lexical resource that includes collocations and set expressions. The overall scheme of such a resource will only be hinted at here. Our approach leads to the elaboration of a multi-agent system, motivated by a need to solve problems encountered when using conventional methods of analysis, and to improve the results of queries thanks to a better collaboration between different levels of analysis. We suggest resorting to four agents: morphological, morpho-lexical, contextualization, and an interface agent. These agents 'negotiate' and 'cooperate' throughout the analysis process, starting from the submission of the initial query, and going on until an adequate query is obtained.
    Content
    Beitrag innerhalb einer Special Section: Knowledge Organization, Competitive Intelligence, and Information Systems - Papers from 4th International Conference on "Information Systems & Economic Intelligence," February 17-19th, 2011. Marrakech - Morocco.
  20. Hoenkamp, E.; Bruza, P.: How everyday language can and will boost effective information retrieval (2015) 0.02
    0.022265585 = product of:
      0.059374895 = sum of:
        0.036153924 = weight(_text_:retrieval in 2123) [ClassicSimilarity], result of:
          0.036153924 = score(doc=2123,freq=6.0), product of:
            0.124912694 = queryWeight, product of:
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.041294612 = queryNorm
            0.28943354 = fieldWeight in 2123, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.024915 = idf(docFreq=5836, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2123)
        0.013664153 = weight(_text_:of in 2123) [ClassicSimilarity], result of:
          0.013664153 = score(doc=2123,freq=12.0), product of:
            0.06457475 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.041294612 = queryNorm
            0.21160212 = fieldWeight in 2123, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2123)
        0.00955682 = product of:
          0.01911364 = sum of:
            0.01911364 = weight(_text_:on in 2123) [ClassicSimilarity], result of:
              0.01911364 = score(doc=2123,freq=6.0), product of:
                0.090823986 = queryWeight, product of:
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.041294612 = queryNorm
                0.21044704 = fieldWeight in 2123, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  2.199415 = idf(docFreq=13325, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2123)
          0.5 = coord(1/2)
      0.375 = coord(3/8)
    
    Abstract
    Typing 2 or 3 keywords into a browser has become an easy and efficient way to find information. Yet, typing even short queries becomes tedious on ever shrinking (virtual) keyboards. Meanwhile, speech processing is maturing rapidly, facilitating everyday language input. Also, wearable technology can inform users proactively by listening in on their conversations or processing their social media interactions. Given these developments, everyday language may soon become the new input of choice. We present an information retrieval (IR) algorithm specifically designed to accept everyday language. It integrates two paradigms of information retrieval, previously studied in isolation; one directed mainly at the surface structure of language, the other primarily at the underlying meaning. The integration was achieved by a Markov machine that encodes meaning by its transition graph, and surface structure by the language it generates. A rigorous evaluation of the approach showed, first, that it can compete with the quality of existing language models, second, that it is more effective the more verbose the input, and third, as a consequence, that it is promising for an imminent transition from keyword input, where the onus is on the user to formulate concise queries, to a modality where users can express more freely, more informal, and more natural their need for information in everyday language.
    Source
    Journal of the Association for Information Science and Technology. 66(2015) no.8, S.1546-1558

Languages

  • e 96
  • d 12
  • el 1
  • More… Less…

Types

  • a 88
  • el 21
  • x 8
  • m 6
  • s 3
  • More… Less…