Search (190 results, page 1 of 10)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.37

0.36854386 = product of:
  0.4913918 = sum of:
    0.04713235 = product of:
      0.14139704 = sum of:
        0.14139704 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
          0.14139704 = score(doc=562,freq=2.0), product of:
            0.25158808 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.029675366 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
      0.33333334 = coord(1/3)
    0.020951848 = weight(_text_:web in 562) [ClassicSimilarity], result of:
      0.020951848 = score(doc=562,freq=2.0), product of:
        0.096845865 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029675366 = queryNorm
        0.21634221 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.14139704 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.14139704 = score(doc=562,freq=2.0), product of:
        0.25158808 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.029675366 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.14139704 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
      0.14139704 = score(doc=562,freq=2.0), product of:
        0.25158808 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.029675366 = queryNorm
        0.56201804 = fieldWeight in 562, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.027816659 = weight(_text_:data in 562) [ClassicSimilarity], result of:
      0.027816659 = score(doc=562,freq=4.0), product of:
        0.093835 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.029675366 = queryNorm
        0.29644224 = fieldWeight in 562, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.046875 = fieldNorm(doc=562)
    0.11269685 = sum of:
      0.08857323 = weight(_text_:mining in 562) [ClassicSimilarity], result of:
        0.08857323 = score(doc=562,freq=4.0), product of:
          0.16744171 = queryWeight, product of:
            5.642448 = idf(docFreq=425, maxDocs=44218)
            0.029675366 = queryNorm
          0.5289795 = fieldWeight in 562, product of:
            2.0 = tf(freq=4.0), with freq of:
              4.0 = termFreq=4.0
            5.642448 = idf(docFreq=425, maxDocs=44218)
            0.046875 = fieldNorm(doc=562)
      0.024123615 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
        0.024123615 = score(doc=562,freq=2.0), product of:
          0.103918076 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.029675366 = queryNorm
          0.23214069 = fieldWeight in 562, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.046875 = fieldNorm(doc=562)
  0.75 = coord(6/8)

Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32
Source: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), 1-4 November 2004, Brighton, UK

Huo, W.: Automatic multi-word term extraction and its application to Web-page summarization (2012) 0.22

0.2227681 = product of:
  0.35642895 = sum of:
    0.041903697 = weight(_text_:web in 563) [ClassicSimilarity], result of:
      0.041903697 = score(doc=563,freq=8.0), product of:
        0.096845865 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029675366 = queryNorm
        0.43268442 = fieldWeight in 563, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=563)
    0.14139704 = weight(_text_:2f in 563) [ClassicSimilarity], result of:
      0.14139704 = score(doc=563,freq=2.0), product of:
        0.25158808 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.029675366 = queryNorm
        0.56201804 = fieldWeight in 563, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=563)
    0.14139704 = weight(_text_:2f in 563) [ClassicSimilarity], result of:
      0.14139704 = score(doc=563,freq=2.0), product of:
        0.25158808 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.029675366 = queryNorm
        0.56201804 = fieldWeight in 563, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=563)
    0.019669347 = weight(_text_:data in 563) [ClassicSimilarity], result of:
      0.019669347 = score(doc=563,freq=2.0), product of:
        0.093835 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.029675366 = queryNorm
        0.2096163 = fieldWeight in 563, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.046875 = fieldNorm(doc=563)
    0.012061807 = product of:
      0.024123615 = sum of:
        0.024123615 = weight(_text_:22 in 563) [ClassicSimilarity], result of:
          0.024123615 = score(doc=563,freq=2.0), product of:
            0.103918076 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.029675366 = queryNorm
            0.23214069 = fieldWeight in 563, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.046875 = fieldNorm(doc=563)
      0.5 = coord(1/2)
  0.625 = coord(5/8)

Abstract: In this thesis we propose three new word association measures for multi-word term extraction. We combine these association measures with LocalMaxs algorithm in our extraction model and compare the results of different multi-word term extraction methods. Our approach is language and domain independent and requires no training data. It can be applied to such tasks as text summarization, information retrieval, and document classification. We further explore the potential of using multi-word terms as an effective representation for general web-page summarization. We extract multi-word terms from human written summaries in a large collection of web-pages, and generate the summaries by aligning document words with these multi-word terms. Our system applies machine translation technology to learn the aligning process from a training set and focuses on selecting high quality multi-word terms from human written summaries to generate suitable results for web-page summarization.
Content: A Thesis presented to The University of Guelph In partial fulfilment of requirements for the degree of Master of Science in Computer Science. Vgl. Unter: http://www.inf.ufrgs.br%2F~ceramisch%2Fdownload_files%2Fpublications%2F2009%2Fp01.pdf.
Date: 10. 1.2013 19:22:47

Noever, D.; Ciolino, M.: ¬The Turing deception (2022) 0.12

0.12372241 = product of:
  0.32992643 = sum of:
    0.04713235 = product of:
      0.14139704 = sum of:
        0.14139704 = weight(_text_:3a in 862) [ClassicSimilarity], result of:
          0.14139704 = score(doc=862,freq=2.0), product of:
            0.25158808 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.029675366 = queryNorm
            0.56201804 = fieldWeight in 862, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=862)
      0.33333334 = coord(1/3)
    0.14139704 = weight(_text_:2f in 862) [ClassicSimilarity], result of:
      0.14139704 = score(doc=862,freq=2.0), product of:
        0.25158808 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.029675366 = queryNorm
        0.56201804 = fieldWeight in 862, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=862)
    0.14139704 = weight(_text_:2f in 862) [ClassicSimilarity], result of:
      0.14139704 = score(doc=862,freq=2.0), product of:
        0.25158808 = queryWeight, product of:
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.029675366 = queryNorm
        0.56201804 = fieldWeight in 862, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          8.478011 = idf(docFreq=24, maxDocs=44218)
          0.046875 = fieldNorm(doc=862)
  0.375 = coord(3/8)

Source: https%3A%2F%2Farxiv.org%2Fabs%2F2212.06721&usg=AOvVaw3i_9pZm9y_dQWoHi6uv0EN

Wang, F.L.; Yang, C.C.: Mining Web data for Chinese segmentation (2007) 0.05
```
0.046506 = product of:
  0.124015994 = sum of:
    0.039041467 = weight(_text_:web in 604) [ClassicSimilarity], result of:
      0.039041467 = score(doc=604,freq=10.0), product of:
        0.096845865 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029675366 = queryNorm
        0.40312994 = fieldWeight in 604, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.0390625 = fieldNorm(doc=604)
    0.032782245 = weight(_text_:data in 604) [ClassicSimilarity], result of:
      0.032782245 = score(doc=604,freq=8.0), product of:
        0.093835 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.029675366 = queryNorm
        0.34936053 = fieldWeight in 604, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.0390625 = fieldNorm(doc=604)
    0.05219228 = product of:
      0.10438456 = sum of:
        0.10438456 = weight(_text_:mining in 604) [ClassicSimilarity], result of:
          0.10438456 = score(doc=604,freq=8.0), product of:
            0.16744171 = queryWeight, product of:
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.029675366 = queryNorm
            0.6234083 = fieldWeight in 604, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.0390625 = fieldNorm(doc=604)
      0.5 = coord(1/2)
  0.375 = coord(3/8)
```
Abstract

Modern information retrieval systems use keywords within documents as indexing terms for search of relevant documents. As Chinese is an ideographic character-based language, the words in the texts are not delimited by white spaces. Indexing of Chinese documents is impossible without a proper segmentation algorithm. Many Chinese segmentation algorithms have been proposed in the past. Traditional segmentation algorithms cannot operate without a large dictionary or a large corpus of training data. Nowadays, the Web has become the largest corpus that is ideal for Chinese segmentation. Although most search engines have problems in segmenting texts into proper words, they maintain huge databases of documents and frequencies of character sequences in the documents. Their databases are important potential resources for segmentation. In this paper, we propose a segmentation algorithm by mining Web data with the help of search engines. On the other hand, the Romanized pinyin of Chinese language indicates boundaries of words in the text. Our algorithm is the first to utilize the Romanized pinyin to segmentation. It is the first unified segmentation algorithm for the Chinese language from different geographical areas, and it is also domain independent because of the nature of the Web. Experiments have been conducted on the datasets of a recent Chinese segmentation competition. The results show that our algorithm outperforms the traditional algorithms in terms of precision and recall. Moreover, our algorithm can effectively deal with the problems of segmentation ambiguity, new word (unknown word) detection, and stop words.

Footnote

Beitrag eines Themenschwerpunktes "Mining Web resources for enhancing information retrieval"

Theme

Data Mining
Heyer, G.; Quasthoff, U.; Wittig, T.: Text Mining : Wissensrohstoff Text. Konzepte, Algorithmen, Ergebnisse (2006) 0.04
```
0.037081905 = product of:
  0.09888508 = sum of:
    0.019753594 = weight(_text_:web in 5218) [ClassicSimilarity], result of:
      0.019753594 = score(doc=5218,freq=4.0), product of:
        0.096845865 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029675366 = queryNorm
        0.2039694 = fieldWeight in 5218, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.03125 = fieldNorm(doc=5218)
    0.013112898 = weight(_text_:data in 5218) [ClassicSimilarity], result of:
      0.013112898 = score(doc=5218,freq=2.0), product of:
        0.093835 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.029675366 = queryNorm
        0.1397442 = fieldWeight in 5218, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.03125 = fieldNorm(doc=5218)
    0.06601859 = product of:
      0.13203718 = sum of:
        0.13203718 = weight(_text_:mining in 5218) [ClassicSimilarity], result of:
          0.13203718 = score(doc=5218,freq=20.0), product of:
            0.16744171 = queryWeight, product of:
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.029675366 = queryNorm
            0.7885561 = fieldWeight in 5218, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.03125 = fieldNorm(doc=5218)
      0.5 = coord(1/2)
  0.375 = coord(3/8)
```
Abstract

Ein großer Teil des Weltwissens befindet sich in Form digitaler Texte im Internet oder in Intranets. Heutige Suchmaschinen nutzen diesen Wissensrohstoff nur rudimentär: Sie können semantische Zusammen-hänge nur bedingt erkennen. Alle warten auf das semantische Web, in dem die Ersteller von Text selbst die Semantik einfügen. Das wird aber noch lange dauern. Es gibt jedoch eine Technologie, die es bereits heute ermöglicht semantische Zusammenhänge in Rohtexten zu analysieren und aufzubereiten. Das Forschungsgebiet "Text Mining" ermöglicht es mit Hilfe statistischer und musterbasierter Verfahren, Wissen aus Texten zu extrahieren, zu verarbeiten und zu nutzen. Hier wird die Basis für die Suchmaschinen der Zukunft gelegt. Das erste deutsche Lehrbuch zu einer bahnbrechenden Technologie: Text Mining: Wissensrohstoff Text Konzepte, Algorithmen, Ergebnisse Ein großer Teil des Weltwissens befindet sich in Form digitaler Texte im Internet oder in Intranets. Heutige Suchmaschinen nutzen diesen Wissensrohstoff nur rudimentär: Sie können semantische Zusammen-hänge nur bedingt erkennen. Alle warten auf das semantische Web, in dem die Ersteller von Text selbst die Semantik einfügen. Das wird aber noch lange dauern. Es gibt jedoch eine Technologie, die es bereits heute ermöglicht semantische Zusammenhänge in Rohtexten zu analysieren und aufzubereiten. Das For-schungsgebiet "Text Mining" ermöglicht es mit Hilfe statistischer und musterbasierter Verfahren, Wissen aus Texten zu extrahieren, zu verarbeiten und zu nutzen. Hier wird die Basis für die Suchmaschinen der Zukunft gelegt. Was fällt Ihnen bei dem Wort "Stich" ein? Die einen denken an Tennis, die anderen an Skat. Die verschiedenen Zusammenhänge können durch Text Mining automatisch ermittelt und in Form von Wortnetzen dargestellt werden. Welche Begriffe stehen am häufigsten links und rechts vom Wort "Festplatte"? Welche Wortformen und Eigennamen treten seit 2001 neu in der deutschen Sprache auf? Text Mining beantwortet diese und viele weitere Fragen. Tauchen Sie mit diesem Lehrbuch ein in eine neue, faszinierende Wissenschaftsdisziplin und entdecken Sie neue, bisher unbekannte Zusammenhänge und Sichtweisen. Sehen Sie, wie aus dem Wissensrohstoff Text Wissen wird! Dieses Lehrbuch richtet sich sowohl an Studierende als auch an Praktiker mit einem fachlichen Schwerpunkt in der Informatik, Wirtschaftsinformatik und/oder Linguistik, die sich über die Grundlagen, Verfahren und Anwendungen des Text Mining informieren möchten und Anregungen für die Implementierung eigener Anwendungen suchen. Es basiert auf Arbeiten, die während der letzten Jahre an der Abteilung Automatische Sprachverarbeitung am Institut für Informatik der Universität Leipzig unter Leitung von Prof. Dr. Heyer entstanden sind. Eine Fülle praktischer Beispiele von Text Mining-Konzepten und -Algorithmen verhelfen dem Leser zu einem umfassenden, aber auch detaillierten Verständnis der Grundlagen und Anwendungen des Text Mining. Folgende Themen werden behandelt: Wissen und Text Grundlagen der Bedeutungsanalyse Textdatenbanken Sprachstatistik Clustering Musteranalyse Hybride Verfahren Beispielanwendungen Anhänge: Statistik und linguistische Grundlagen 360 Seiten, 54 Abb., 58 Tabellen und 95 Glossarbegriffe Mit kostenlosen e-learning-Kurs "Schnelleinstieg: Sprachstatistik" Zusätzlich zum Buch gibt es in Kürze einen Online-Zertifikats-Kurs mit Mentor- und Tutorunterstützung.

Theme

Data Mining
Yang, C.C.; Luk, J.: Automatic generation of English/Chinese thesaurus based on a parallel corpus in laws (2003) 0.04
```
0.036592 = product of:
  0.09757866 = sum of:
    0.022528138 = weight(_text_:wide in 1616) [ClassicSimilarity], result of:
      0.022528138 = score(doc=1616,freq=2.0), product of:
        0.13148437 = queryWeight, product of:
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.029675366 = queryNorm
        0.171337 = fieldWeight in 1616, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.02734375 = fieldNorm(doc=1616)
    0.024443826 = weight(_text_:web in 1616) [ClassicSimilarity], result of:
      0.024443826 = score(doc=1616,freq=8.0), product of:
        0.096845865 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029675366 = queryNorm
        0.25239927 = fieldWeight in 1616, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.02734375 = fieldNorm(doc=1616)
    0.0506067 = sum of:
      0.036534593 = weight(_text_:mining in 1616) [ClassicSimilarity], result of:
        0.036534593 = score(doc=1616,freq=2.0), product of:
          0.16744171 = queryWeight, product of:
            5.642448 = idf(docFreq=425, maxDocs=44218)
            0.029675366 = queryNorm
          0.2181929 = fieldWeight in 1616, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            5.642448 = idf(docFreq=425, maxDocs=44218)
            0.02734375 = fieldNorm(doc=1616)
      0.014072108 = weight(_text_:22 in 1616) [ClassicSimilarity], result of:
        0.014072108 = score(doc=1616,freq=2.0), product of:
          0.103918076 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.029675366 = queryNorm
          0.1354154 = fieldWeight in 1616, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.02734375 = fieldNorm(doc=1616)
  0.375 = coord(3/8)
```
Abstract

The information available in languages other than English in the World Wide Web is increasing significantly. According to a report from Computer Economics in 1999, 54% of Internet users are English speakers ("English Will Dominate Web for Only Three More Years," Computer Economics, July 9, 1999, http://www.computereconomics. com/new4/pr/pr990610.html). However, it is predicted that there will be only 60% increase in Internet users among English speakers verses a 150% growth among nonEnglish speakers for the next five years. By 2005, 57% of Internet users will be non-English speakers. A report by CNN.com in 2000 showed that the number of Internet users in China had been increased from 8.9 million to 16.9 million from January to June in 2000 ("Report: China Internet users double to 17 million," CNN.com, July, 2000, http://cnn.org/2000/TECH/computing/07/27/ china.internet.reut/index.html). According to Nielsen/ NetRatings, there was a dramatic leap from 22.5 millions to 56.6 millions Internet users from 2001 to 2002. China had become the second largest global at-home Internet population in 2002 (US's Internet population was 166 millions) (Robyn Greenspan, "China Pulls Ahead of Japan," Internet.com, April 22, 2002, http://cyberatias.internet.com/big-picture/geographics/article/0,,5911_1013841,00. html). All of the evidences reveal the importance of crosslingual research to satisfy the needs in the near future. Digital library research has been focusing in structural and semantic interoperability in the past. Searching and retrieving objects across variations in protocols, formats and disciplines are widely explored (Schatz, B., & Chen, H. (1999). Digital libraries: technological advances and social impacts. IEEE Computer, Special Issue an Digital Libraries, February, 32(2), 45-50.; Chen, H., Yen, J., & Yang, C.C. (1999). International activities: development of Asian digital libraries. IEEE Computer, Special Issue an Digital Libraries, 32(2), 48-49.). However, research in crossing language boundaries, especially across European languages and Oriental languages, is still in the initial stage. In this proposal, we put our focus an cross-lingual semantic interoperability by developing automatic generation of a cross-lingual thesaurus based an English/Chinese parallel corpus. When the searchers encounter retrieval problems, Professional librarians usually consult the thesaurus to identify other relevant vocabularies. In the problem of searching across language boundaries, a cross-lingual thesaurus, which is generated by co-occurrence analysis and Hopfield network, can be used to generate additional semantically relevant terms that cannot be obtained from dictionary. In particular, the automatically generated cross-lingual thesaurus is able to capture the unknown words that do not exist in a dictionary, such as names of persons, organizations, and events. Due to Hong Kong's unique history background, both English and Chinese are used as official languages in all legal documents. Therefore, English/Chinese cross-lingual information retrieval is critical for applications in courts and the government. In this paper, we develop an automatic thesaurus by the Hopfield network based an a parallel corpus collected from the Web site of the Department of Justice of the Hong Kong Special Administrative Region (HKSAR) Government. Experiments are conducted to measure the precision and recall of the automatic generated English/Chinese thesaurus. The result Shows that such thesaurus is a promising tool to retrieve relevant terms, especially in the language that is not the same as the input term. The direct translation of the input term can also be retrieved in most of the cases.

Footnote

Teil eines Themenheftes: "Web retrieval and mining: A machine learning perspective"

Al-Khatib, K.; Ghosa, T.; Hou, Y.; Waard, A. de; Freitag, D.: Argument mining for scholarly document processing : taking stock and looking ahead (2021) 0.03

0.02810967 = product of:
  0.11243868 = sum of:
    0.022947572 = weight(_text_:data in 568) [ClassicSimilarity], result of:
      0.022947572 = score(doc=568,freq=2.0), product of:
        0.093835 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.029675366 = queryNorm
        0.24455236 = fieldWeight in 568, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.0546875 = fieldNorm(doc=568)
    0.08949111 = product of:
      0.17898221 = sum of:
        0.17898221 = weight(_text_:mining in 568) [ClassicSimilarity], result of:
          0.17898221 = score(doc=568,freq=12.0), product of:
            0.16744171 = queryWeight, product of:
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.029675366 = queryNorm
            1.0689225 = fieldWeight in 568, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.0546875 = fieldNorm(doc=568)
      0.5 = coord(1/2)
  0.25 = coord(2/8)

Abstract: Argument mining targets structures in natural language related to interpretation and persuasion. Most scholarly discourse involves interpreting experimental evidence and attempting to persuade other scientists to adopt the same conclusions, which could benefit from argument mining techniques. However, While various argument mining studies have addressed student essays and news articles, those that target scientific discourse are still scarce. This paper surveys existing work in argument mining of scholarly discourse, and provides an overview of current models, data, tasks, and applications. We identify a number of key challenges confronting argument mining in the scientific domain, and suggest some possible solutions and future directions.

Perovsek, M.; Kranjca, J.; Erjaveca, T.; Cestnika, B.; Lavraca, N.: TextFlows : a visual programming platform for text mining and natural language processing (2016) 0.02

0.024913419 = product of:
  0.099653676 = sum of:
    0.029630389 = weight(_text_:web in 2697) [ClassicSimilarity], result of:
      0.029630389 = score(doc=2697,freq=4.0), product of:
        0.096845865 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029675366 = queryNorm
        0.3059541 = fieldWeight in 2697, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=2697)
    0.07002329 = product of:
      0.14004658 = sum of:
        0.14004658 = weight(_text_:mining in 2697) [ClassicSimilarity], result of:
          0.14004658 = score(doc=2697,freq=10.0), product of:
            0.16744171 = queryWeight, product of:
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.029675366 = queryNorm
            0.83639 = fieldWeight in 2697, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.046875 = fieldNorm(doc=2697)
      0.5 = coord(1/2)
  0.25 = coord(2/8)

Abstract: Text mining and natural language processing are fast growing areas of research, with numerous applications in business, science and creative industries. This paper presents TextFlows, a web-based text mining and natural language processing platform supporting workflow construction, sharing and execution. The platform enables visual construction of text mining workflows through a web browser, and the execution of the constructed workflows on a processing cloud. This makes TextFlows an adaptable infrastructure for the construction and sharing of text processing workflows, which can be reused in various applications. The paper presents the implemented text mining and language processing modules, and describes some precomposed workflows. Their features are demonstrated on three use cases: comparison of document classifiers and of different part-of-speech taggers on a text categorization problem, and outlier detection in document corpora.

Symonds, M.; Bruza, P.; Zuccon, G.; Koopman, B.; Sitbon, L.; Turner, I.: Automatic query expansion : a structural linguistic perspective (2014) 0.02

0.024762768 = product of:
  0.06603405 = sum of:
    0.03218305 = weight(_text_:wide in 1338) [ClassicSimilarity], result of:
      0.03218305 = score(doc=1338,freq=2.0), product of:
        0.13148437 = queryWeight, product of:
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.029675366 = queryNorm
        0.24476713 = fieldWeight in 1338, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1338)
    0.017459875 = weight(_text_:web in 1338) [ClassicSimilarity], result of:
      0.017459875 = score(doc=1338,freq=2.0), product of:
        0.096845865 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029675366 = queryNorm
        0.18028519 = fieldWeight in 1338, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1338)
    0.016391123 = weight(_text_:data in 1338) [ClassicSimilarity], result of:
      0.016391123 = score(doc=1338,freq=2.0), product of:
        0.093835 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.029675366 = queryNorm
        0.17468026 = fieldWeight in 1338, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1338)
  0.375 = coord(3/8)

Abstract: A user's query is considered to be an imprecise description of their information need. Automatic query expansion is the process of reformulating the original query with the goal of improving retrieval effectiveness. Many successful query expansion techniques model syntagmatic associations that infer two terms co-occur more often than by chance in natural language. However, structural linguistics relies on both syntagmatic and paradigmatic associations to deduce the meaning of a word. Given the success of dependency-based approaches to query expansion and the reliance on word meanings in the query formulation process, we argue that modeling both syntagmatic and paradigmatic information in the query expansion process improves retrieval effectiveness. This article develops and evaluates a new query expansion technique that is based on a formal, corpus-based model of word meaning that models syntagmatic and paradigmatic associations. We demonstrate that when sufficient statistical information exists, as in the case of longer queries, including paradigmatic information alone provides significant improvements in retrieval effectiveness across a wide variety of data sets. More generally, when our new query expansion approach is applied to large-scale web retrieval it demonstrates significant improvements in retrieval effectiveness over a strong baseline system, based on a commercial search engine.

Belbachir, F.; Boughanem, M.: Using language models to improve opinion detection (2018) 0.02
```
0.021583881 = product of:
  0.057557017 = sum of:
    0.0139679 = weight(_text_:web in 5044) [ClassicSimilarity], result of:
      0.0139679 = score(doc=5044,freq=2.0), product of:
        0.096845865 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029675366 = queryNorm
        0.14422815 = fieldWeight in 5044, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.03125 = fieldNorm(doc=5044)
    0.022712206 = weight(_text_:data in 5044) [ClassicSimilarity], result of:
      0.022712206 = score(doc=5044,freq=6.0), product of:
        0.093835 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.029675366 = queryNorm
        0.24204408 = fieldWeight in 5044, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.03125 = fieldNorm(doc=5044)
    0.02087691 = product of:
      0.04175382 = sum of:
        0.04175382 = weight(_text_:mining in 5044) [ClassicSimilarity], result of:
          0.04175382 = score(doc=5044,freq=2.0), product of:
            0.16744171 = queryWeight, product of:
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.029675366 = queryNorm
            0.24936332 = fieldWeight in 5044, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.03125 = fieldNorm(doc=5044)
      0.5 = coord(1/2)
  0.375 = coord(3/8)
```
Abstract

Opinion mining is one of the most important research tasks in the information retrieval research community. With the huge volume of opinionated data available on the Web, approaches must be developed to differentiate opinion from fact. In this paper, we present a lexicon-based approach for opinion retrieval. Generally, opinion retrieval consists of two stages: relevance to the query and opinion detection. In our work, we focus on the second state which itself focusses on detecting opinionated documents . We compare the document to be analyzed with opinionated sources that contain subjective information. We hypothesize that a document with a strong similarity to opinionated sources is more likely to be opinionated itself. Typical lexicon-based approaches treat and choose their opinion sources according to their test collection, then calculate the opinion score based on the frequency of subjective terms in the document. In our work, we use different open opinion collections without any specific treatment and consider them as a reference collection. We then use language models to determine opinion scores. The analysis document and reference collection are represented by different language models (i.e., Dirichlet, Jelinek-Mercer and two-stage models). These language models are generally used in information retrieval to represent the relationship between documents and queries. However, in our study, we modify these language models to represent opinionated documents. We carry out several experiments using Text REtrieval Conference (TREC) Blogs 06 as our analysis collection and Internet Movie Data Bases (IMDB), Multi-Perspective Question Answering (MPQA) and CHESLY as our reference collection. To improve opinion detection, we study the impact of using different language models to represent the document and reference collection alongside different combinations of opinion and retrieval scores. We then use this data to deduce the best opinion detection models. Using the best models, our approach improves on the best baseline of TREC Blog (baseline4) by 30%.
Working with conceptual structures : contributions to ICCS 2000. 8th International Conference on Conceptual Structures: Logical, Linguistic, and Computational Issues. Darmstadt, August 14-18, 2000 (2000) 0.02
```
0.019881506 = product of:
  0.053017348 = sum of:
    0.022528138 = weight(_text_:wide in 5089) [ClassicSimilarity], result of:
      0.022528138 = score(doc=5089,freq=2.0), product of:
        0.13148437 = queryWeight, product of:
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.029675366 = queryNorm
        0.171337 = fieldWeight in 5089, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.02734375 = fieldNorm(doc=5089)
    0.012221913 = weight(_text_:web in 5089) [ClassicSimilarity], result of:
      0.012221913 = score(doc=5089,freq=2.0), product of:
        0.096845865 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029675366 = queryNorm
        0.12619963 = fieldWeight in 5089, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.02734375 = fieldNorm(doc=5089)
    0.018267296 = product of:
      0.036534593 = sum of:
        0.036534593 = weight(_text_:mining in 5089) [ClassicSimilarity], result of:
          0.036534593 = score(doc=5089,freq=2.0), product of:
            0.16744171 = queryWeight, product of:
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.029675366 = queryNorm
            0.2181929 = fieldWeight in 5089, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.02734375 = fieldNorm(doc=5089)
      0.5 = coord(1/2)
  0.375 = coord(3/8)
```
Abstract

The 8th International Conference on Conceptual Structures - Logical, Linguistic, and Computational Issues (ICCS 2000) brings together a wide range of researchers and practitioners working with conceptual structures. During the last few years, the ICCS conference series has considerably widened its scope on different kinds of conceptual structures, stimulating research across domain boundaries. We hope that this stimulation is further enhanced by ICCS 2000 joining the long tradition of conferences in Darmstadt with extensive, lively discussions. This volume consists of contributions presented at ICCS 2000, complementing the volume "Conceptual Structures: Logical, Linguistic, and Computational Issues" (B. Ganter, G.W. Mineau (Eds.), LNAI 1867, Springer, Berlin-Heidelberg 2000). It contains submissions reviewed by the program committee, and position papers. We wish to express our appreciation to all the authors of submitted papers, to the general chair, the program chair, the editorial board, the program committee, and to the additional reviewers for making ICCS 2000 a valuable contribution in the knowledge processing research field. Special thanks go to the local organizers for making the conference an enjoyable and inspiring event. We are grateful to Darmstadt University of Technology, the Ernst Schröder Center for Conceptual Knowledge Processing, the Center for Interdisciplinary Studies in Technology, the Deutsche Forschungsgemeinschaft, Land Hessen, and NaviCon GmbH for their generous support

Content

Concepts & Language: Knowledge organization by procedures of natural language processing. A case study using the method GABEK (J. Zelger, J. Gadner) - Computer aided narrative analysis using conceptual graphs (H. Schärfe, P. 0hrstrom) - Pragmatic representation of argumentative text: a challenge for the conceptual graph approach (H. Irandoust, B. Moulin) - Conceptual graphs as a knowledge representation core in a complex language learning environment (G. Angelova, A. Nenkova, S. Boycheva, T. Nikolov) - Conceptual Modeling and Ontologies: Relationships and actions in conceptual categories (Ch. Landauer, K.L. Bellman) - Concept approximations for formal concept analysis (J. Saquer, J.S. Deogun) - Faceted information representation (U. Priß) - Simple concept graphs with universal quantifiers (J. Tappe) - A framework for comparing methods for using or reusing multiple ontologies in an application (J. van ZyI, D. Corbett) - Designing task/method knowledge-based systems with conceptual graphs (M. Leclère, F.Trichet, Ch. Choquet) - A logical ontology (J. Farkas, J. Sarbo) - Algorithms and Tools: Fast concept analysis (Ch. Lindig) - A framework for conceptual graph unification (D. Corbett) - Visual CP representation of knowledge (H.D. Pfeiffer, R.T. Hartley) - Maximal isojoin for representing software textual specifications and detecting semantic anomalies (Th. Charnois) - Troika: using grids, lattices and graphs in knowledge acquisition (H.S. Delugach, B.E. Lampkin) - Open world theorem prover for conceptual graphs (J.E. Heaton, P. Kocura) - NetCare: a practical conceptual graphs software tool (S. Polovina, D. Strang) - CGWorld - a web based workbench for conceptual graphs management and applications (P. Dobrev, K. Toutanova) - Position papers: The edition project: Peirce's existential graphs (R. Mülller) - Mining association rules using formal concept analysis (N. Pasquier) - Contextual logic summary (R Wille) - Information channels and conceptual scaling (K.E. Wolff) - Spatial concepts - a rule exploration (S. Rudolph) - The TEXT-TO-ONTO learning environment (A. Mädche, St. Staab) - Controlling the semantics of metadata on audio-visual documents using ontologies (Th. Dechilly, B. Bachimont) - Building the ontological foundations of a terminology from natural language to conceptual graphs with Ribosome, a knowledge extraction system (Ch. Jacquelinet, A. Burgun) - CharGer: some lessons learned and new directions (H.S. Delugach) - Knowledge management using conceptual graphs (W.K. Pun)

Schneider, R.: Web 3.0 ante portas? : Integration von Social Web und Semantic Web (2008) 0.02

0.019686097 = product of:
  0.07874439 = sum of:
    0.064672284 = weight(_text_:web in 4184) [ClassicSimilarity], result of:
      0.064672284 = score(doc=4184,freq=14.0), product of:
        0.096845865 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029675366 = queryNorm
        0.6677857 = fieldWeight in 4184, product of:
          3.7416575 = tf(freq=14.0), with freq of:
            14.0 = termFreq=14.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.0546875 = fieldNorm(doc=4184)
    0.014072108 = product of:
      0.028144216 = sum of:
        0.028144216 = weight(_text_:22 in 4184) [ClassicSimilarity], result of:
          0.028144216 = score(doc=4184,freq=2.0), product of:
            0.103918076 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.029675366 = queryNorm
            0.2708308 = fieldWeight in 4184, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=4184)
      0.5 = coord(1/2)
  0.25 = coord(2/8)

Abstract: Das Medium Internet ist im Wandel, und mit ihm ändern sich seine Publikations- und Rezeptionsbedingungen. Welche Chancen bieten die momentan parallel diskutierten Zukunftsentwürfe von Social Web und Semantic Web? Zur Beantwortung dieser Frage beschäftigt sich der Beitrag mit den Grundlagen beider Modelle unter den Aspekten Anwendungsbezug und Technologie, beleuchtet darüber hinaus jedoch auch deren Unzulänglichkeiten sowie den Mehrwert einer mediengerechten Kombination. Am Beispiel des grammatischen Online-Informationssystems grammis wird eine Strategie zur integrativen Nutzung der jeweiligen Stärken skizziert.
Date: 22. 1.2011 10:38:28
Source: Kommunikation, Partizipation und Wirkungen im Social Web, Band 1. Hrsg.: A. Zerfaß u.a
Theme: Semantic Web

Witschel, H.F.: Terminologie-Extraktion : Möglichkeiten der Kombination statistischer uns musterbasierter Verfahren (2004) 0.02
```
0.018843208 = product of:
  0.07537283 = sum of:
    0.023180548 = weight(_text_:data in 123) [ClassicSimilarity], result of:
      0.023180548 = score(doc=123,freq=4.0), product of:
        0.093835 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.029675366 = queryNorm
        0.24703519 = fieldWeight in 123, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.0390625 = fieldNorm(doc=123)
    0.05219228 = product of:
      0.10438456 = sum of:
        0.10438456 = weight(_text_:mining in 123) [ClassicSimilarity], result of:
          0.10438456 = score(doc=123,freq=8.0), product of:
            0.16744171 = queryWeight, product of:
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.029675366 = queryNorm
            0.6234083 = fieldWeight in 123, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.0390625 = fieldNorm(doc=123)
      0.5 = coord(1/2)
  0.25 = coord(2/8)
```
Abstract

Die Suche nach Informationen in unstrukturierten natürlichsprachlichen Daten ist Gegenstand des sogenannten Text Mining. In dieser Arbeit wird ein Teilgebiet des Text Mining beleuchtet, nämlich die Extraktion domänenspezifischer Fachbegriffe aus Fachtexten der jeweiligen Domäne. Wofür überhaupt Terminologie-Extraktion? Die Antwort darauf ist einfach: der Schlüssel zum Verständnis vieler Fachgebiete liegt in der Kenntnis der zugehörigen Terminologie. Natürlich genügt es nicht, nur eine Liste der Fachtermini einer Domäne zu kennen, um diese zu durchdringen. Eine solche Liste ist aber eine wichtige Voraussetzung für die Erstellung von Fachwörterbüchern (man denke z.B. an Nachschlagewerke wie das klinische Wörterbuch "Pschyrembel"): zunächst muß geklärt werden, welche Begriffe in das Wörterbuch aufgenommen werden sollen, bevor man sich Gedanken um die genaue Definition der einzelnen Termini machen kann. Ein Fachwörterbuch sollte genau diejenigen Begriffe einer Domäne beinhalten, welche Gegenstand der Forschung in diesem Gebiet sind oder waren. Was liegt also näher, als entsprechende Fachliteratur zu betrachten und das darin enthaltene Wissen in Form von Fachtermini zu extrahieren? Darüberhinaus sind weitere Anwendungen der Terminologie-Extraktion denkbar, wie z.B. die automatische Beschlagwortung von Texten oder die Erstellung sogenannter Topic Maps, welche wichtige Begriffe zu einem Thema darstellt und in Beziehung setzt. Es muß also zunächst die Frage geklärt werden, was Terminologie eigentlich ist, vor allem aber werden verschiedene Methoden entwickelt, welche die Eigenschaften von Fachtermini ausnutzen, um diese aufzufinden. Die Verfahren werden aus den linguistischen und 'statistischen' Charakteristika von Fachbegriffen hergeleitet und auf geeignete Weise kombiniert.

LCSH

Data processing

RSWK

Sachtext / Text Mining

Subject

Sachtext / Text Mining
Data processing

Thelwall, M.; Price, L.: Language evolution and the spread of ideas on the Web : a procedure for identifying emergent hybrid word (2006) 0.02

0.018727332 = product of:
  0.07490933 = sum of:
    0.038619664 = weight(_text_:wide in 5896) [ClassicSimilarity], result of:
      0.038619664 = score(doc=5896,freq=2.0), product of:
        0.13148437 = queryWeight, product of:
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.029675366 = queryNorm
        0.29372054 = fieldWeight in 5896, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.046875 = fieldNorm(doc=5896)
    0.03628967 = weight(_text_:web in 5896) [ClassicSimilarity], result of:
      0.03628967 = score(doc=5896,freq=6.0), product of:
        0.096845865 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029675366 = queryNorm
        0.37471575 = fieldWeight in 5896, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=5896)
  0.25 = coord(2/8)

Abstract: Word usage is of interest to linguists for its own sake as well as to social scientists and others who seek to track the spread of ideas, for example, in public debates over political decisions. The historical evolution of language can be analyzed with the tools of corpus linguistics through evolving corpora and the Web. But word usage statistics can only be gathered for known words. In this article, techniques are described and tested for identifying new words from the Web, focusing on the case when the words are related to a topic and have a hybrid form with a common sequence of letters. The results highlight the need to employ a combination of search techniques and show the wide potential of hybrid word family investigations in linguistics and social science.

Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I.: Improving language understanding by Generative Pre-Training 0.02
```
0.018571451 = product of:
  0.074285805 = sum of:
    0.054616455 = weight(_text_:wide in 870) [ClassicSimilarity], result of:
      0.054616455 = score(doc=870,freq=4.0), product of:
        0.13148437 = queryWeight, product of:
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.029675366 = queryNorm
        0.4153836 = fieldWeight in 870, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.046875 = fieldNorm(doc=870)
    0.019669347 = weight(_text_:data in 870) [ClassicSimilarity], result of:
      0.019669347 = score(doc=870,freq=2.0), product of:
        0.093835 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.029675366 = queryNorm
        0.2096163 = fieldWeight in 870, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.046875 = fieldNorm(doc=870)
  0.25 = coord(2/8)
```
Abstract

Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).
Li, Q.; Chen, Y.P.; Myaeng, S.-H.; Jin, Y.; Kang, B.-Y.: Concept unification of terms in different languages via web mining for Information Retrieval (2009) 0.02
```
0.017956316 = product of:
  0.071825266 = sum of:
    0.03491975 = weight(_text_:web in 4215) [ClassicSimilarity], result of:
      0.03491975 = score(doc=4215,freq=8.0), product of:
        0.096845865 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029675366 = queryNorm
        0.36057037 = fieldWeight in 4215, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.0390625 = fieldNorm(doc=4215)
    0.036905512 = product of:
      0.073811024 = sum of:
        0.073811024 = weight(_text_:mining in 4215) [ClassicSimilarity], result of:
          0.073811024 = score(doc=4215,freq=4.0), product of:
            0.16744171 = queryWeight, product of:
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.029675366 = queryNorm
            0.44081625 = fieldWeight in 4215, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.0390625 = fieldNorm(doc=4215)
      0.5 = coord(1/2)
  0.25 = coord(2/8)
```
Abstract

For historical and cultural reasons, English phrases, especially proper nouns and new words, frequently appear in Web pages written primarily in East Asian languages such as Chinese, Korean, and Japanese. Although such English terms and their equivalences in these East Asian languages refer to the same concept, they are often erroneously treated as independent index units in traditional Information Retrieval (IR). This paper describes the degree to which the problem arises in IR and proposes a novel technique to solve it. Our method first extracts English terms from native Web documents in an East Asian language, and then unifies the extracted terms and their equivalences in the native language as one index unit. For Cross-Language Information Retrieval (CLIR), one of the major hindrances to achieving retrieval performance at the level of Mono-Lingual Information Retrieval (MLIR) is the translation of terms in search queries which can not be found in a bilingual dictionary. The Web mining approach proposed in this paper for concept unification of terms in different languages can also be applied to solve this well-known challenge in CLIR. Experimental results based on NTCIR and KT-Set test collections show that the high translation precision of our approach greatly improves performance of both Mono-Lingual and Cross-Language Information Retrieval.
Rozinajová, V.; Macko, P.: Using natural language to search linked data (2017) 0.02
```
0.017892854 = product of:
  0.07157142 = sum of:
    0.03491975 = weight(_text_:web in 3488) [ClassicSimilarity], result of:
      0.03491975 = score(doc=3488,freq=8.0), product of:
        0.096845865 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029675366 = queryNorm
        0.36057037 = fieldWeight in 3488, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3488)
    0.036651667 = weight(_text_:data in 3488) [ClassicSimilarity], result of:
      0.036651667 = score(doc=3488,freq=10.0), product of:
        0.093835 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.029675366 = queryNorm
        0.39059696 = fieldWeight in 3488, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3488)
  0.25 = coord(2/8)
```
Abstract

There are many endeavors aiming to offer users more effective ways of getting relevant information from web. One of them is represented by a concept of Linked Data, which provides interconnected data sources. But querying these types of data is difficult not only for the conventional web users but also for ex-perts in this field. Therefore, a more comfortable way of user query would be of great value. One direction could be to allow the user to use a natural language. To make this task easier we have proposed a method for translating natural language query to SPARQL query. It is based on a sentence structure - utilizing dependen-cies between the words in user queries. Dependencies are used to map the query to the semantic web structure, which is in the next step translated to SPARQL query. According to our first experiments we are able to answer a significant group of user queries.

Series

Information Systems and Applications, incl. Internet/Web, and HCI; 10151

Source

Semantic keyword-based search on structured data sources: COST Action IC1302. Second International KEYSTONE Conference, IKC 2016, Cluj-Napoca, Romania, September 8-9, 2016, Revised Selected Papers. Eds.: A. Calì, A. et al
Gill, A.J.; Hinrichs-Krapels, S.; Blanke, T.; Grant, J.; Hedges, M.; Tanner, S.: Insight workflow : systematically combining human and computational methods to explore textual data (2017) 0.02
```
0.016561506 = product of:
  0.066246025 = sum of:
    0.040149886 = weight(_text_:data in 3682) [ClassicSimilarity], result of:
      0.040149886 = score(doc=3682,freq=12.0), product of:
        0.093835 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.029675366 = queryNorm
        0.4278775 = fieldWeight in 3682, product of:
          3.4641016 = tf(freq=12.0), with freq of:
            12.0 = termFreq=12.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.0390625 = fieldNorm(doc=3682)
    0.02609614 = product of:
      0.05219228 = sum of:
        0.05219228 = weight(_text_:mining in 3682) [ClassicSimilarity], result of:
          0.05219228 = score(doc=3682,freq=2.0), product of:
            0.16744171 = queryWeight, product of:
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.029675366 = queryNorm
            0.31170416 = fieldWeight in 3682, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3682)
      0.5 = coord(1/2)
  0.25 = coord(2/8)
```
Abstract

Analyzing large quantities of real-world textual data has the potential to provide new insights for researchers. However, such data present challenges for both human and computational methods, requiring a diverse range of specialist skills, often shared across a number of individuals. In this paper we use the analysis of a real-world data set as our case study, and use this exploration as a demonstration of our "insight workflow," which we present for use and adaptation by other researchers. The data we use are impact case study documents collected as part of the UK Research Excellence Framework (REF), consisting of 6,679 documents and 6.25 million words; the analysis was commissioned by the Higher Education Funding Council for England (published as report HEFCE 2015). In our exploration and analysis we used a variety of techniques, ranging from keyword in context and frequency information to more sophisticated methods (topic modeling), with these automated techniques providing an empirical point of entry for in-depth and intensive human analysis. We present the 60 topics to demonstrate the output of our methods, and illustrate how the variety of analysis techniques can be combined to provide insights. We note potential limitations and propose future work.

Theme

Data Mining

Heyer, G.; Läuter, M.; Quasthoff, U.; Wolff, C.: Texttechnologische Anwendungen am Beispiel Text Mining (2000) 0.02

0.01598899 = product of:
  0.06395596 = sum of:
    0.019669347 = weight(_text_:data in 5565) [ClassicSimilarity], result of:
      0.019669347 = score(doc=5565,freq=2.0), product of:
        0.093835 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.029675366 = queryNorm
        0.2096163 = fieldWeight in 5565, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.046875 = fieldNorm(doc=5565)
    0.044286616 = product of:
      0.08857323 = sum of:
        0.08857323 = weight(_text_:mining in 5565) [ClassicSimilarity], result of:
          0.08857323 = score(doc=5565,freq=4.0), product of:
            0.16744171 = queryWeight, product of:
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.029675366 = queryNorm
            0.5289795 = fieldWeight in 5565, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.046875 = fieldNorm(doc=5565)
      0.5 = coord(1/2)
  0.25 = coord(2/8)

Theme: Data Mining

Chowdhury, G.G.: Natural language processing (2002) 0.01
```
0.014892878 = product of:
  0.059571512 = sum of:
    0.038619664 = weight(_text_:wide in 4284) [ClassicSimilarity], result of:
      0.038619664 = score(doc=4284,freq=2.0), product of:
        0.13148437 = queryWeight, product of:
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.029675366 = queryNorm
        0.29372054 = fieldWeight in 4284, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.046875 = fieldNorm(doc=4284)
    0.020951848 = weight(_text_:web in 4284) [ClassicSimilarity], result of:
      0.020951848 = score(doc=4284,freq=2.0), product of:
        0.096845865 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029675366 = queryNorm
        0.21634221 = fieldWeight in 4284, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=4284)
  0.25 = coord(2/8)
```
Abstract

Natural Language Processing (NLP) is an area of research and application that explores how computers can be used to understand and manipulate natural language text or speech to do useful things. NLP researchers aim to gather knowledge an how human beings understand and use language so that appropriate tools and techniques can be developed to make computer systems understand and manipulate natural languages to perform desired tasks. The foundations of NLP lie in a number of disciplines, namely, computer and information sciences, linguistics, mathematics, electrical and electronic engineering, artificial intelligence and robotics, and psychology. Applications of NLP include a number of fields of study, such as machine translation, natural language text processing and summarization, user interfaces, multilingual and cross-language information retrieval (CLIR), speech recognition, artificial intelligence, and expert systems. One important application area that is relatively new and has not been covered in previous ARIST chapters an NLP relates to the proliferation of the World Wide Web and digital libraries.

Search (190 results, page 1 of 10)

Authors

Years

Languages

Types

Themes

Subjects

Classifications