Search (40 results, page 1 of 2)

  • × theme_ss:"Computerlinguistik"
  • × year_i:[2020 TO 2030}
  1. Noever, D.; Ciolino, M.: ¬The Turing deception (2022) 0.07
    0.06787068 = product of:
      0.101806015 = sum of:
        0.08288895 = product of:
          0.24866685 = sum of:
            0.24866685 = weight(_text_:3a in 862) [ClassicSimilarity], result of:
              0.24866685 = score(doc=862,freq=2.0), product of:
                0.4424535 = queryWeight, product of:
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.05218836 = queryNorm
                0.56201804 = fieldWeight in 862, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.046875 = fieldNorm(doc=862)
          0.33333334 = coord(1/3)
        0.01891706 = weight(_text_:of in 862) [ClassicSimilarity], result of:
          0.01891706 = score(doc=862,freq=10.0), product of:
            0.08160993 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.05218836 = queryNorm
            0.23179851 = fieldWeight in 862, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=862)
      0.6666667 = coord(2/3)
    
    Abstract
    This research revisits the classic Turing test and compares recent large language models such as ChatGPT for their abilities to reproduce human-level comprehension and compelling text generation. Two task challenges- summary and question answering- prompt ChatGPT to produce original content (98-99%) from a single text entry and sequential questions initially posed by Turing in 1950. We score the original and generated content against the OpenAI GPT-2 Output Detector from 2019, and establish multiple cases where the generated content proves original and undetectable (98%). The question of a machine fooling a human judge recedes in this work relative to the question of "how would one prove it?" The original contribution of the work presents a metric and simple grammatical set for understanding the writing mechanics of chatbots in evaluating their readability and statistical clarity, engagement, delivery, overall quality, and plagiarism risks. While Turing's original prose scores at least 14% below the machine-generated output, whether an algorithm displays hints of Turing's true initial thoughts (the "Lovelace 2.0" test) remains unanswerable.
    Source
    https%3A%2F%2Farxiv.org%2Fabs%2F2212.06721&usg=AOvVaw3i_9pZm9y_dQWoHi6uv0EN
  2. Morris, V.: Automated language identification of bibliographic resources (2020) 0.04
    0.037275564 = product of:
      0.055913344 = sum of:
        0.027630134 = weight(_text_:of in 5749) [ClassicSimilarity], result of:
          0.027630134 = score(doc=5749,freq=12.0), product of:
            0.08160993 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.05218836 = queryNorm
            0.33856338 = fieldWeight in 5749, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0625 = fieldNorm(doc=5749)
        0.028283209 = product of:
          0.056566417 = sum of:
            0.056566417 = weight(_text_:22 in 5749) [ClassicSimilarity], result of:
              0.056566417 = score(doc=5749,freq=2.0), product of:
                0.18275474 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.05218836 = queryNorm
                0.30952093 = fieldWeight in 5749, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0625 = fieldNorm(doc=5749)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    This article describes experiments in the use of machine learning techniques at the British Library to assign language codes to catalog records, in order to provide information about the language of content of the resources described. In the first phase of the project, language codes were assigned to 1.15 million records with 99.7% confidence. The automated language identification tools developed will be used to contribute to future enhancement of over 4 million legacy records.
    Date
    2. 3.2020 19:04:22
  3. Meng, K.; Ba, Z.; Ma, Y.; Li, G.: ¬A network coupling approach to detecting hierarchical linkages between science and technology (2024) 0.03
    0.031955566 = product of:
      0.047933348 = sum of:
        0.023928396 = weight(_text_:of in 1205) [ClassicSimilarity], result of:
          0.023928396 = score(doc=1205,freq=16.0), product of:
            0.08160993 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.05218836 = queryNorm
            0.2932045 = fieldWeight in 1205, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=1205)
        0.02400495 = product of:
          0.0480099 = sum of:
            0.0480099 = weight(_text_:science in 1205) [ClassicSimilarity], result of:
              0.0480099 = score(doc=1205,freq=8.0), product of:
                0.13747036 = queryWeight, product of:
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.05218836 = queryNorm
                0.34923816 = fieldWeight in 1205, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.046875 = fieldNorm(doc=1205)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    Detecting science-technology hierarchical linkages is beneficial for understanding deep interactions between science and technology (S&T). Previous studies have mainly focused on linear linkages between S&T but ignored their structural linkages. In this paper, we propose a network coupling approach to inspect hierarchical interactions of S&T by integrating their knowledge linkages and structural linkages. S&T knowledge networks are first enhanced with bidirectional encoder representation from transformers (BERT) knowledge alignment, and then their hierarchical structures are identified based on K-core decomposition. Hierarchical coupling preferences and strengths of the S&T networks over time are further calculated based on similarities of coupling nodes' degree distribution and similarities of coupling edges' weight distribution. Extensive experimental results indicate that our approach is feasible and robust in identifying the coupling hierarchy with superior performance compared to other isomorphism and dissimilarity algorithms. Our research extends the mindset of S&T linkage measurement by identifying patterns and paths of the interaction of S&T hierarchical knowledge.
    Source
    Journal of the Association for Information Science and Technology. 75(2023) no.2, S.167-187
  4. Tao, J.; Zhou, L.; Hickey, K.: Making sense of the black-boxes : toward interpretable text classification using deep learning models (2023) 0.03
    0.02641203 = product of:
      0.039618045 = sum of:
        0.022293966 = weight(_text_:of in 990) [ClassicSimilarity], result of:
          0.022293966 = score(doc=990,freq=20.0), product of:
            0.08160993 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.05218836 = queryNorm
            0.27317715 = fieldWeight in 990, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=990)
        0.01732408 = product of:
          0.03464816 = sum of:
            0.03464816 = weight(_text_:science in 990) [ClassicSimilarity], result of:
              0.03464816 = score(doc=990,freq=6.0), product of:
                0.13747036 = queryWeight, product of:
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.05218836 = queryNorm
                0.25204095 = fieldWeight in 990, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=990)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    Text classification is a common task in data science. Despite the superior performances of deep learning based models in various text classification tasks, their black-box nature poses significant challenges for wide adoption. The knowledge-to-action framework emphasizes several principles concerning the application and use of knowledge, such as ease-of-use, customization, and feedback. With the guidance of the above principles and the properties of interpretable machine learning, we identify the design requirements for and propose an interpretable deep learning (IDeL) based framework for text classification models. IDeL comprises three main components: feature penetration, instance aggregation, and feature perturbation. We evaluate our implementation of the framework with two distinct case studies: fake news detection and social question categorization. The experiment results provide evidence for the efficacy of IDeL components in enhancing the interpretability of text classification models. Moreover, the findings are generalizable across binary and multi-label, multi-class classification problems. The proposed IDeL framework introduce a unique iField perspective for building trusted models in data science by improving the transparency and access to advanced black-box models.
    Source
    Journal of the Association for Information Science and Technology. 74(2023) no.6, S.685-700
  5. Luo, L.; Ju, J.; Li, Y.-F.; Haffari, G.; Xiong, B.; Pan, S.: ChatRule: mining logical rules with large language models for knowledge graph reasoning (2023) 0.03
    0.025884613 = product of:
      0.03882692 = sum of:
        0.021149913 = weight(_text_:of in 1171) [ClassicSimilarity], result of:
          0.021149913 = score(doc=1171,freq=18.0), product of:
            0.08160993 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.05218836 = queryNorm
            0.25915858 = fieldWeight in 1171, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1171)
        0.017677005 = product of:
          0.03535401 = sum of:
            0.03535401 = weight(_text_:22 in 1171) [ClassicSimilarity], result of:
              0.03535401 = score(doc=1171,freq=2.0), product of:
                0.18275474 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.05218836 = queryNorm
                0.19345059 = fieldWeight in 1171, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1171)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    Logical rules are essential for uncovering the logical connections between relations, which could improve the reasoning performance and provide interpretable results on knowledge graphs (KGs). Although there have been many efforts to mine meaningful logical rules over KGs, existing methods suffer from the computationally intensive searches over the rule space and a lack of scalability for large-scale KGs. Besides, they often ignore the semantics of relations which is crucial for uncovering logical connections. Recently, large language models (LLMs) have shown impressive performance in the field of natural language processing and various applications, owing to their emergent ability and generalizability. In this paper, we propose a novel framework, ChatRule, unleashing the power of large language models for mining logical rules over knowledge graphs. Specifically, the framework is initiated with an LLM-based rule generator, leveraging both the semantic and structural information of KGs to prompt LLMs to generate logical rules. To refine the generated rules, a rule ranking module estimates the rule quality by incorporating facts from existing KGs. Last, a rule validator harnesses the reasoning ability of LLMs to validate the logical correctness of ranked rules through chain-of-thought reasoning. ChatRule is evaluated on four large-scale KGs, w.r.t. different rule quality metrics and downstream tasks, showing the effectiveness and scalability of our method.
    Date
    23.11.2023 19:07:22
  6. Andrushchenko, M.; Sandberg, K.; Turunen, R.; Marjanen, J.; Hatavara, M.; Kurunmäki, J.; Nummenmaa, T.; Hyvärinen, M.; Teräs, K.; Peltonen, J.; Nummenmaa, J.: Using parsed and annotated corpora to analyze parliamentarians' talk in Finland (2022) 0.03
    0.025018109 = product of:
      0.037527163 = sum of:
        0.02338211 = weight(_text_:of in 471) [ClassicSimilarity], result of:
          0.02338211 = score(doc=471,freq=22.0), product of:
            0.08160993 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.05218836 = queryNorm
            0.28651062 = fieldWeight in 471, product of:
              4.690416 = tf(freq=22.0), with freq of:
                22.0 = termFreq=22.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=471)
        0.014145052 = product of:
          0.028290104 = sum of:
            0.028290104 = weight(_text_:science in 471) [ClassicSimilarity], result of:
              0.028290104 = score(doc=471,freq=4.0), product of:
                0.13747036 = queryWeight, product of:
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.05218836 = queryNorm
                0.20579056 = fieldWeight in 471, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=471)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    We present a search system for grammatically analyzed corpora of Finnish parliamentary records and interviews with former parliamentarians, annotated with metadata of talk structure and involved parliamentarians, and discuss their use through carefully chosen digital humanities case studies. We first introduce the construction, contents, and principles of use of the corpora. Then we discuss the application of the search system and the corpora to study how politicians talk about power, how ideological terms are used in political speech, and how to identify narratives in the data. All case studies stem from questions in the humanities and the social sciences, but rely on the grammatically parsed corpora in both identifying and quantifying passages of interest. Finally, the paper discusses the role of natural language processing methods for questions in the (digital) humanities. It makes the claim that a digital humanities inquiry of parliamentary speech and interviews with politicians cannot only rely on computational humanities modeling, but needs to accommodate a range of perspectives starting with simple searches, quantitative exploration, and ending with modeling. Furthermore, the digital humanities need a more thorough discussion about how the utilization of tools from information science and technologies alter the research questions posed in the humanities.
    Source
    Journal of the Association for Information Science and Technology. 73(2022) no.2, S.288-302
  7. Lee, G.E.; Sun, A.: Understanding the stability of medical concept embeddings (2021) 0.02
    0.024870992 = product of:
      0.037306488 = sum of:
        0.027304424 = weight(_text_:of in 159) [ClassicSimilarity], result of:
          0.027304424 = score(doc=159,freq=30.0), product of:
            0.08160993 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.05218836 = queryNorm
            0.33457235 = fieldWeight in 159, product of:
              5.477226 = tf(freq=30.0), with freq of:
                30.0 = termFreq=30.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=159)
        0.010002062 = product of:
          0.020004123 = sum of:
            0.020004123 = weight(_text_:science in 159) [ClassicSimilarity], result of:
              0.020004123 = score(doc=159,freq=2.0), product of:
                0.13747036 = queryWeight, product of:
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.05218836 = queryNorm
                0.1455159 = fieldWeight in 159, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=159)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    Frequency is one of the major factors for training quality word embeddings. Several studies have recently discussed the stability of word embeddings in general domain and suggested factors influencing the stability. In this work, we conduct a detailed analysis on the stability of concept embeddings in medical domain, particularly in relations with concept frequency. The analysis reveals the surprising high stability of low-frequency concepts: low-frequency (<100) concepts have the same high stability as high-frequency (>1,000) concepts. To develop a deeper understanding of this finding, we propose a new factor, the noisiness of context words, which influences the stability of medical concept embeddings regardless of high or low frequency. We evaluate the proposed factor by showing the linear correlation with the stability of medical concept embeddings. The correlations are clear and consistent with various groups of medical concepts. Based on the linear relations, we make suggestions on ways to adjust the noisiness of context words for the improvement of stability. Finally, we demonstrate that the linear relation of the proposed factor extends to the word embedding stability in general domain.
    Source
    Journal of the Association for Information Science and Technology. 72(2021) no.3, S.346-356
  8. Jha, A.: Why GPT-4 isn't all it's cracked up to be (2023) 0.02
    0.02305096 = product of:
      0.03457644 = sum of:
        0.024674902 = weight(_text_:of in 923) [ClassicSimilarity], result of:
          0.024674902 = score(doc=923,freq=50.0), product of:
            0.08160993 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.05218836 = queryNorm
            0.3023517 = fieldWeight in 923, product of:
              7.071068 = tf(freq=50.0), with freq of:
                50.0 = termFreq=50.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.02734375 = fieldNorm(doc=923)
        0.009901537 = product of:
          0.019803073 = sum of:
            0.019803073 = weight(_text_:science in 923) [ClassicSimilarity], result of:
              0.019803073 = score(doc=923,freq=4.0), product of:
                0.13747036 = queryWeight, product of:
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.05218836 = queryNorm
                0.1440534 = fieldWeight in 923, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.02734375 = fieldNorm(doc=923)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    "I still don't know what to think about GPT-4, the new large language model (LLM) from OpenAI. On the one hand it is a remarkable product that easily passes the Turing test. If you ask it questions, via the ChatGPT interface, GPT-4 can easily produce fluid sentences largely indistinguishable from those a person might write. But on the other hand, amid the exceptional levels of hype and anticipation, it's hard to know where GPT-4 and other LLMs truly fit in the larger project of making machines intelligent.
    They might appear intelligent, but LLMs are nothing of the sort. They don't understand the meanings of the words they are using, nor the concepts expressed within the sentences they create. When asked how to bring a cow back to life, earlier versions of ChatGPT, for example, which ran on a souped-up version of GPT-3, would confidently provide a list of instructions. So-called hallucinations like this happen because language models have no concept of what a "cow" is or that "death" is a non-reversible state of being. LLMs do not have minds that can think about objects in the world and how they relate to each other. All they "know" is how likely it is that some sets of words will follow other sets of words, having calculated those probabilities from their training data. To make sense of all this, I spoke with Gary Marcus, an emeritus professor of psychology and neural science at New York University, for "Babbage", our science and technology podcast. Last year, as the world was transfixed by the sudden appearance of ChatGPT, he made some fascinating predictions about GPT-4.
    He doesn't dismiss the potential of LLMs to become useful assistants in all sorts of ways-Google and Microsoft have already announced that they will be integrating LLMs into their search and office productivity software. But he talked me through some of his criticisms of the technology's apparent capabilities. At the heart of Dr Marcus's thoughtful critique is an attempt to put LLMs into proper context. Deep learning, the underlying technology that makes LLMs work, is only one piece of the puzzle in the quest for machine intelligence. To reach the level of artificial general intelligence (AGI) that many tech companies strive for-i.e. machines that can plan, reason and solve problems in the way human brains can-they will need to deploy a suite of other AI techniques. These include, for example, the kind of "symbolic AI" that was popular before artificial neural networks and deep learning became all the rage.
    People use symbols to think about the world: if I say the words "cat", "house" or "aeroplane", you know instantly what I mean. Symbols can also be used to describe the way things are behaving (running, falling, flying) or they can represent how things should behave in relation to each other (a "+" means add the numbers before and after). Symbolic AI is a way to embed this human knowledge and reasoning into computer systems. Though the idea has been around for decades, it fell by the wayside a few years ago as deep learning-buoyed by the sudden easy availability of lots of training data and cheap computing power-became more fashionable. In the near future at least, there's no doubt people will find LLMs useful. But whether they represent a critical step on the path towards AGI, or rather just an intriguing detour, remains to be seen."
  9. Suissa, O.; Elmalech, A.; Zhitomirsky-Geffet, M.: Text analysis using deep neural networks in digital humanities and information science (2022) 0.02
    0.022723591 = product of:
      0.034085386 = sum of:
        0.019940332 = weight(_text_:of in 491) [ClassicSimilarity], result of:
          0.019940332 = score(doc=491,freq=16.0), product of:
            0.08160993 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.05218836 = queryNorm
            0.24433708 = fieldWeight in 491, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=491)
        0.014145052 = product of:
          0.028290104 = sum of:
            0.028290104 = weight(_text_:science in 491) [ClassicSimilarity], result of:
              0.028290104 = score(doc=491,freq=4.0), product of:
                0.13747036 = queryWeight, product of:
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.05218836 = queryNorm
                0.20579056 = fieldWeight in 491, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=491)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    Combining computational technologies and humanities is an ongoing effort aimed at making resources such as texts, images, audio, video, and other artifacts digitally available, searchable, and analyzable. In recent years, deep neural networks (DNN) dominate the field of automatic text analysis and natural language processing (NLP), in some cases presenting a super-human performance. DNNs are the state-of-the-art machine learning algorithms solving many NLP tasks that are relevant for Digital Humanities (DH) research, such as spell checking, language detection, entity extraction, author detection, question answering, and other tasks. These supervised algorithms learn patterns from a large number of "right" and "wrong" examples and apply them to new examples. However, using DNNs for analyzing the text resources in DH research presents two main challenges: (un)availability of training data and a need for domain adaptation. This paper explores these challenges by analyzing multiple use-cases of DH studies in recent literature and their possible solutions and lays out a practical decision model for DH experts for when and how to choose the appropriate deep learning approaches for their research. Moreover, in this paper, we aim to raise awareness of the benefits of utilizing deep learning models in the DH community.
    Source
    Journal of the Association for Information Science and Technology. 73(2022) no.2, S.268-287
  10. Xiang, R.; Chersoni, E.; Lu, Q.; Huang, C.-R.; Li, W.; Long, Y.: Lexical data augmentation for sentiment analysis (2021) 0.02
    0.022256117 = product of:
      0.033384174 = sum of:
        0.02338211 = weight(_text_:of in 392) [ClassicSimilarity], result of:
          0.02338211 = score(doc=392,freq=22.0), product of:
            0.08160993 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.05218836 = queryNorm
            0.28651062 = fieldWeight in 392, product of:
              4.690416 = tf(freq=22.0), with freq of:
                22.0 = termFreq=22.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=392)
        0.010002062 = product of:
          0.020004123 = sum of:
            0.020004123 = weight(_text_:science in 392) [ClassicSimilarity], result of:
              0.020004123 = score(doc=392,freq=2.0), product of:
                0.13747036 = queryWeight, product of:
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.05218836 = queryNorm
                0.1455159 = fieldWeight in 392, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=392)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    Machine learning methods, especially deep learning models, have achieved impressive performance in various natural language processing tasks including sentiment analysis. However, deep learning models are more demanding for training data. Data augmentation techniques are widely used to generate new instances based on modifications to existing data or relying on external knowledge bases to address annotated data scarcity, which hinders the full potential of machine learning techniques. This paper presents our work using part-of-speech (POS) focused lexical substitution for data augmentation (PLSDA) to enhance the performance of machine learning algorithms in sentiment analysis. We exploit POS information to identify words to be replaced and investigate different augmentation strategies to find semantically related substitutions when generating new instances. The choice of POS tags as well as a variety of strategies such as semantic-based substitution methods and sampling methods are discussed in detail. Performance evaluation focuses on the comparison between PLSDA and two previous lexical substitution-based data augmentation methods, one of which is thesaurus-based, and the other is lexicon manipulation based. Our approach is tested on five English sentiment analysis benchmarks: SST-2, MR, IMDB, Twitter, and AirRecord. Hyperparameters such as the candidate similarity threshold and number of newly generated instances are optimized. Results show that six classifiers (SVM, LSTM, BiLSTM-AT, bidirectional encoder representations from transformers [BERT], XLNet, and RoBERTa) trained with PLSDA achieve accuracy improvement of more than 0.6% comparing to two previous lexical substitution methods averaged on five benchmarks. Introducing POS constraint and well-designed augmentation strategies can improve the reliability of lexical data augmentation methods. Consequently, PLSDA significantly improves the performance of sentiment analysis algorithms.
    Source
    Journal of the Association for Information Science and Technology. 72(2021) no.11, S.1432-1447
  11. Corbara, S.; Moreo, A.; Sebastiani, F.: Syllabic quantity patterns as rhythmic features for Latin authorship attribution (2023) 0.02
    0.021816716 = product of:
      0.032725073 = sum of:
        0.020722598 = weight(_text_:of in 846) [ClassicSimilarity], result of:
          0.020722598 = score(doc=846,freq=12.0), product of:
            0.08160993 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.05218836 = queryNorm
            0.25392252 = fieldWeight in 846, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=846)
        0.012002475 = product of:
          0.02400495 = sum of:
            0.02400495 = weight(_text_:science in 846) [ClassicSimilarity], result of:
              0.02400495 = score(doc=846,freq=2.0), product of:
                0.13747036 = queryWeight, product of:
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.05218836 = queryNorm
                0.17461908 = fieldWeight in 846, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.046875 = fieldNorm(doc=846)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    It is well known that, within the Latin production of written text, peculiar metric schemes were followed not only in poetic compositions, but also in many prose works. Such metric patterns were based on so-called syllabic quantity, that is, on the length of the involved syllables, and there is substantial evidence suggesting that certain authors had a preference for certain metric patterns over others. In this research we investigate the possibility to employ syllabic quantity as a base for deriving rhythmic features for the task of computational authorship attribution of Latin prose texts. We test the impact of these features on the authorship attribution task when combined with other topic-agnostic features. Our experiments, carried out on three different datasets using support vector machines (SVMs) show that rhythmic features based on syllabic quantity are beneficial in discriminating among Latin prose authors.
    Source
    Journal of the Association for Information Science and Technology. 74(2023) no.1, S.128-141
  12. Lund, B.D.; Wang, T.; Mannuru, N.R.; Nie, B.; Shimray, S.; Wang, Z.: ChatGPT and a new academic reality : artificial Intelligence-written research papers and the ethics of the large language models in scholarly publishing (2023) 0.02
    0.021816716 = product of:
      0.032725073 = sum of:
        0.020722598 = weight(_text_:of in 943) [ClassicSimilarity], result of:
          0.020722598 = score(doc=943,freq=12.0), product of:
            0.08160993 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.05218836 = queryNorm
            0.25392252 = fieldWeight in 943, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.046875 = fieldNorm(doc=943)
        0.012002475 = product of:
          0.02400495 = sum of:
            0.02400495 = weight(_text_:science in 943) [ClassicSimilarity], result of:
              0.02400495 = score(doc=943,freq=2.0), product of:
                0.13747036 = queryWeight, product of:
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.05218836 = queryNorm
                0.17461908 = fieldWeight in 943, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.046875 = fieldNorm(doc=943)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    This article discusses OpenAI's ChatGPT, a generative pre-trained transformer, which uses natural language processing to fulfill text-based user requests (i.e., a "chatbot"). The history and principles behind ChatGPT and similar models are discussed. This technology is then discussed in relation to its potential impact on academia and scholarly research and publishing. ChatGPT is seen as a potential model for the automated preparation of essays and other types of scholarly manuscripts. Potential ethical issues that could arise with the emergence of large language models like GPT-3, the underlying technology behind ChatGPT, and its usage by academics and researchers, are discussed and situated within the context of broader advancements in artificial intelligence, machine learning, and natural language processing for research and scholarly publishing.
    Source
    Journal of the Association for Information Science and Technology. 74(2023) no.5, S.570-581
  13. Zhang, Y.; Zhang, C.; Li, J.: Joint modeling of characters, words, and conversation contexts for microblog keyphrase extraction (2020) 0.02
    0.021530686 = product of:
      0.032296028 = sum of:
        0.022293966 = weight(_text_:of in 5816) [ClassicSimilarity], result of:
          0.022293966 = score(doc=5816,freq=20.0), product of:
            0.08160993 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.05218836 = queryNorm
            0.27317715 = fieldWeight in 5816, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5816)
        0.010002062 = product of:
          0.020004123 = sum of:
            0.020004123 = weight(_text_:science in 5816) [ClassicSimilarity], result of:
              0.020004123 = score(doc=5816,freq=2.0), product of:
                0.13747036 = queryWeight, product of:
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.05218836 = queryNorm
                0.1455159 = fieldWeight in 5816, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5816)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    Millions of messages are produced on microblog platforms every day, leading to the pressing need for automatic identification of key points from the massive texts. To absorb salient content from the vast bulk of microblog posts, this article focuses on the task of microblog keyphrase extraction. In previous work, most efforts treat messages as independent documents and might suffer from the data sparsity problem exhibited in short and informal microblog posts. On the contrary, we propose to enrich contexts via exploiting conversations initialized by target posts and formed by their replies, which are generally centered around relevant topics to the target posts and therefore helpful for keyphrase identification. Concretely, we present a neural keyphrase extraction framework, which has 2 modules: a conversation context encoder and a keyphrase tagger. The conversation context encoder captures indicative representation from their conversation contexts and feeds the representation into the keyphrase tagger, and the keyphrase tagger extracts salient words from target posts. The 2 modules were trained jointly to optimize the conversation context encoding and keyphrase extraction processes. In the conversation context encoder, we leverage hierarchical structures to capture the word-level indicative representation and message-level indicative representation hierarchically. In both of the modules, we apply character-level representations, which enables the model to explore morphological features and deal with the out-of-vocabulary problem caused by the informal language style of microblog messages. Extensive comparison results on real-life data sets indicate that our model outperforms state-of-the-art models from previous studies.
    Source
    Journal of the Association for Information Science and Technology. 71(2020) no.5, S.553-567
  14. Soni, S.; Lerman, K.; Eisenstein, J.: Follow the leader : documents on the leading edge of semantic change get more citations (2021) 0.02
    0.021530686 = product of:
      0.032296028 = sum of:
        0.022293966 = weight(_text_:of in 169) [ClassicSimilarity], result of:
          0.022293966 = score(doc=169,freq=20.0), product of:
            0.08160993 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.05218836 = queryNorm
            0.27317715 = fieldWeight in 169, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=169)
        0.010002062 = product of:
          0.020004123 = sum of:
            0.020004123 = weight(_text_:science in 169) [ClassicSimilarity], result of:
              0.020004123 = score(doc=169,freq=2.0), product of:
                0.13747036 = queryWeight, product of:
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.05218836 = queryNorm
                0.1455159 = fieldWeight in 169, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=169)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    Diachronic word embeddings-vector representations of words over time-offer remarkable insights into the evolution of language and provide a tool for quantifying sociocultural change from text documents. Prior work has used such embeddings to identify shifts in the meaning of individual words. However, simply knowing that a word has changed in meaning is insufficient to identify the instances of word usage that convey the historical meaning or the newer meaning. In this study, we link diachronic word embeddings to documents, by situating those documents as leaders or laggards with respect to ongoing semantic changes. Specifically, we propose a novel method to quantify the degree of semantic progressiveness in each word usage, and then show how these usages can be aggregated to obtain scores for each document. We analyze two large collections of documents, representing legal opinions and scientific articles. Documents that are scored as semantically progressive receive a larger number of citations, indicating that they are especially influential. Our work thus provides a new technique for identifying lexical semantic leaders and demonstrates a new link between progressive use of language and influence in a citation network.
    Source
    Journal of the Association for Information Science and Technology. 72(2021) no.4, S.478-492
  15. Escolano, C.; Costa-Jussà, M.R.; Fonollosa, J.A.: From bilingual to multilingual neural-based machine translation by incremental training (2021) 0.02
    0.020767983 = product of:
      0.031151975 = sum of:
        0.021149913 = weight(_text_:of in 97) [ClassicSimilarity], result of:
          0.021149913 = score(doc=97,freq=18.0), product of:
            0.08160993 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.05218836 = queryNorm
            0.25915858 = fieldWeight in 97, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=97)
        0.010002062 = product of:
          0.020004123 = sum of:
            0.020004123 = weight(_text_:science in 97) [ClassicSimilarity], result of:
              0.020004123 = score(doc=97,freq=2.0), product of:
                0.13747036 = queryWeight, product of:
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.05218836 = queryNorm
                0.1455159 = fieldWeight in 97, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=97)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    A common intermediate language representation in neural machine translation can be used to extend bilingual systems by incremental training. We propose a new architecture based on introducing an interlingual loss as an additional training objective. By adding and forcing this interlingual loss, we can train multiple encoders and decoders for each language, sharing among them a common intermediate representation. Translation results on the low-resource tasks (Turkish-English and Kazakh-English tasks) show a BLEU improvement of up to 2.8 points. However, results on a larger dataset (Russian-English and Kazakh-English) show BLEU losses of a similar amount. While our system provides improvements only for the low-resource tasks in terms of translation quality, our system is capable of quickly deploying new language pairs without the need to retrain the rest of the system, which may be a game changer in some situations. Specifically, what is most relevant regarding our architecture is that it is capable of: reducing the number of production systems, with respect to the number of languages, from quadratic to linear; incrementally adding a new language to the system without retraining the languages already there; and allowing for translations from the new language to all the others present in the system.
    Source
    Journal of the Association for Information Science and Technology. 72(2021) no.2, S.190-203
  16. Azpiazu, I.M.; Soledad Pera, M.: Is cross-lingual readability assessment possible? (2020) 0.02
    0.019403009 = product of:
      0.029104512 = sum of:
        0.021102862 = weight(_text_:of in 5868) [ClassicSimilarity], result of:
          0.021102862 = score(doc=5868,freq=28.0), product of:
            0.08160993 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.05218836 = queryNorm
            0.25858206 = fieldWeight in 5868, product of:
              5.2915025 = tf(freq=28.0), with freq of:
                28.0 = termFreq=28.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.03125 = fieldNorm(doc=5868)
        0.00800165 = product of:
          0.0160033 = sum of:
            0.0160033 = weight(_text_:science in 5868) [ClassicSimilarity], result of:
              0.0160033 = score(doc=5868,freq=2.0), product of:
                0.13747036 = queryWeight, product of:
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.05218836 = queryNorm
                0.11641272 = fieldWeight in 5868, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.03125 = fieldNorm(doc=5868)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    Most research efforts related to automatic readability assessment focus on the design of strategies that apply to a specific language. These state-of-the-art strategies are highly dependent on linguistic features that best suit the language for which they were intended, constraining their adaptability and making it difficult to determine whether they would remain effective if they were applied to estimate the level of difficulty of texts in other languages. In this article, we present the results of a study designed to determine the feasibility of a cross-lingual readability assessment strategy. For doing so, we first analyzed the most common features used for readability assessment and determined their influence on the readability prediction process of 6 different languages: English, Spanish, Basque, Italian, French, and Catalan. In addition, we developed a cross-lingual readability assessment strategy that serves as a means to empirically explore the potential advantages of employing a single strategy (and set of features) for readability assessment in different languages, including interlanguage prediction agreement and prediction accuracy improvement for low-resource languages.Friend request acceptance and information disclosure constitute 2 important privacy decisions for users to control the flow of their personal information in social network sites (SNSs). These decisions are greatly influenced by contextual characteristics of the request. However, the contextual influence may not be uniform among users with different levels of privacy concerns. In this study, we hypothesize that users with higher privacy concerns may consider contextual factors differently from those with lower privacy concerns. By conducting a scenario-based survey study and structural equation modeling, we verify the interaction effects between privacy concerns and contextual factors. We additionally find that users' perceived risk towards the requester mediates the effect of context and privacy concerns. These results extend our understanding about the cognitive process behind privacy decision making in SNSs. The interaction effects suggest strategies for SNS providers to predict user's friend request acceptance and to customize context-aware privacy decision support based on users' different privacy attitudes.
    Source
    Journal of the Association for Information Science and Technology. 71(2020) no.6, S.644-656
  17. Laparra, E.; Binford-Walsh, A.; Emerson, K.; Miller, M.L.; López-Hoffman, L.; Currim, F.; Bethard, S.: Addressing structural hurdles for metadata extraction from environmental impact statements (2023) 0.02
    0.018180598 = product of:
      0.027270896 = sum of:
        0.017268835 = weight(_text_:of in 1042) [ClassicSimilarity], result of:
          0.017268835 = score(doc=1042,freq=12.0), product of:
            0.08160993 = queryWeight, product of:
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.05218836 = queryNorm
            0.21160212 = fieldWeight in 1042, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.5637573 = idf(docFreq=25162, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1042)
        0.010002062 = product of:
          0.020004123 = sum of:
            0.020004123 = weight(_text_:science in 1042) [ClassicSimilarity], result of:
              0.020004123 = score(doc=1042,freq=2.0), product of:
                0.13747036 = queryWeight, product of:
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.05218836 = queryNorm
                0.1455159 = fieldWeight in 1042, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  2.6341193 = idf(docFreq=8627, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1042)
          0.5 = coord(1/2)
      0.6666667 = coord(2/3)
    
    Abstract
    Natural language processing techniques can be used to analyze the linguistic content of a document to extract missing pieces of metadata. However, accurate metadata extraction may not depend solely on the linguistics, but also on structural problems such as extremely large documents, unordered multi-file documents, and inconsistency in manually labeled metadata. In this work, we start from two standard machine learning solutions to extract pieces of metadata from Environmental Impact Statements, environmental policy documents that are regularly produced under the US National Environmental Policy Act of 1969. We present a series of experiments where we evaluate how these standard approaches are affected by different issues derived from real-world data. We find that metadata extraction can be strongly influenced by nonlinguistic factors such as document length and volume ordering and that the standard machine learning solutions often do not scale well to long documents. We demonstrate how such solutions can be better adapted to these scenarios, and conclude with suggestions for other NLP practitioners cataloging large document collections.
    Source
    Journal of the Association for Information Science and Technology. 74(2023) no.9, S.1124-1139
  18. ¬Der Student aus dem Computer (2023) 0.02
    0.01649854 = product of:
      0.049495615 = sum of:
        0.049495615 = product of:
          0.09899123 = sum of:
            0.09899123 = weight(_text_:22 in 1079) [ClassicSimilarity], result of:
              0.09899123 = score(doc=1079,freq=2.0), product of:
                0.18275474 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.05218836 = queryNorm
                0.5416616 = fieldWeight in 1079, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.109375 = fieldNorm(doc=1079)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Date
    27. 1.2023 16:22:55
  19. Bager, J.: ¬Die Text-KI ChatGPT schreibt Fachtexte, Prosa, Gedichte und Programmcode (2023) 0.01
    0.0094277365 = product of:
      0.028283209 = sum of:
        0.028283209 = product of:
          0.056566417 = sum of:
            0.056566417 = weight(_text_:22 in 835) [ClassicSimilarity], result of:
              0.056566417 = score(doc=835,freq=2.0), product of:
                0.18275474 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.05218836 = queryNorm
                0.30952093 = fieldWeight in 835, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0625 = fieldNorm(doc=835)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Date
    29.12.2022 18:22:55
  20. Rieger, F.: Lügende Computer (2023) 0.01
    0.0094277365 = product of:
      0.028283209 = sum of:
        0.028283209 = product of:
          0.056566417 = sum of:
            0.056566417 = weight(_text_:22 in 912) [ClassicSimilarity], result of:
              0.056566417 = score(doc=912,freq=2.0), product of:
                0.18275474 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.05218836 = queryNorm
                0.30952093 = fieldWeight in 912, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0625 = fieldNorm(doc=912)
          0.5 = coord(1/2)
      0.33333334 = coord(1/3)
    
    Date
    16. 3.2023 19:22:55

Languages

  • e 36
  • d 4

Types

  • a 32
  • el 17
  • p 7
  • x 1
  • More… Less…