Search (14 results, page 1 of 1)

  • × language_ss:"e"
  • × theme_ss:"Computerlinguistik"
  • × year_i:[2020 TO 2030}
  1. Noever, D.; Ciolino, M.: ¬The Turing deception (2022) 0.12
    0.12172589 = sum of:
      0.08709657 = product of:
        0.26128972 = sum of:
          0.26128972 = weight(_text_:3a in 862) [ClassicSimilarity], result of:
            0.26128972 = score(doc=862,freq=2.0), product of:
              0.4649134 = queryWeight, product of:
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.054837555 = queryNorm
              0.56201804 = fieldWeight in 862, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.046875 = fieldNorm(doc=862)
        0.33333334 = coord(1/3)
      0.03462931 = product of:
        0.06925862 = sum of:
          0.06925862 = weight(_text_:work in 862) [ClassicSimilarity], result of:
            0.06925862 = score(doc=862,freq=4.0), product of:
              0.20127523 = queryWeight, product of:
                3.6703904 = idf(docFreq=3060, maxDocs=44218)
                0.054837555 = queryNorm
              0.3440991 = fieldWeight in 862, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.6703904 = idf(docFreq=3060, maxDocs=44218)
                0.046875 = fieldNorm(doc=862)
        0.5 = coord(1/2)
    
    Abstract
    This research revisits the classic Turing test and compares recent large language models such as ChatGPT for their abilities to reproduce human-level comprehension and compelling text generation. Two task challenges- summary and question answering- prompt ChatGPT to produce original content (98-99%) from a single text entry and sequential questions initially posed by Turing in 1950. We score the original and generated content against the OpenAI GPT-2 Output Detector from 2019, and establish multiple cases where the generated content proves original and undetectable (98%). The question of a machine fooling a human judge recedes in this work relative to the question of "how would one prove it?" The original contribution of the work presents a metric and simple grammatical set for understanding the writing mechanics of chatbots in evaluating their readability and statistical clarity, engagement, delivery, overall quality, and plagiarism risks. While Turing's original prose scores at least 14% below the machine-generated output, whether an algorithm displays hints of Turing's true initial thoughts (the "Lovelace 2.0" test) remains unanswerable.
    Source
    https%3A%2F%2Farxiv.org%2Fabs%2F2212.06721&usg=AOvVaw3i_9pZm9y_dQWoHi6uv0EN
  2. Morris, V.: Automated language identification of bibliographic resources (2020) 0.01
    0.014859463 = product of:
      0.029718926 = sum of:
        0.029718926 = product of:
          0.059437852 = sum of:
            0.059437852 = weight(_text_:22 in 5749) [ClassicSimilarity], result of:
              0.059437852 = score(doc=5749,freq=2.0), product of:
                0.19203177 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.054837555 = queryNorm
                0.30952093 = fieldWeight in 5749, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0625 = fieldNorm(doc=5749)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Date
    2. 3.2020 19:04:22
  3. Pepper, S.; Arnaud, P.J.L.: Absolutely PHAB : toward a general model of associative relations (2020) 0.01
    0.01442888 = product of:
      0.02885776 = sum of:
        0.02885776 = product of:
          0.05771552 = sum of:
            0.05771552 = weight(_text_:work in 103) [ClassicSimilarity], result of:
              0.05771552 = score(doc=103,freq=4.0), product of:
                0.20127523 = queryWeight, product of:
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.054837555 = queryNorm
                0.28674924 = fieldWeight in 103, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=103)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    There have been many attempts at classifying the semantic modification relations (R) of N + N compounds but this work has not led to the acceptance of a definitive scheme, so that devising a reusable classification is a worthwhile aim. The scope of this undertaking is extended to other binominal lexemes, i.e. units that contain two thing-morphemes without explicitly stating R, like prepositional units, N + relational adjective units, etc. The 25-relation taxonomy of Bourque (2014) was tested against over 15,000 binominal lexemes from 106 languages and extended to a 29-relation scheme ("Bourque2") through the introduction of two new reversible relations. Bourque2 is then mapped onto Hatcher's (1960) four-relation scheme (extended by the addition of a fifth relation, similarity , as "Hatcher2"). This results in a two-tier system usable at different degrees of granularities. On account of its semantic proximity to compounding, metonymy is then taken into account, following Janda's (2011) suggestion that it plays a role in word formation; Peirsman and Geeraerts' (2006) inventory of 23 metonymic patterns is mapped onto Bourque2, confirming the identity of metonymic and binominal modification relations. Finally, Blank's (2003) and Koch's (2001) work on lexical semantics justifies the addition to the scheme of a third, superordinate level which comprises the three Aristotelean principles of similarity, contiguity and contrast.
  4. Soni, S.; Lerman, K.; Eisenstein, J.: Follow the leader : documents on the leading edge of semantic change get more citations (2021) 0.01
    0.01442888 = product of:
      0.02885776 = sum of:
        0.02885776 = product of:
          0.05771552 = sum of:
            0.05771552 = weight(_text_:work in 169) [ClassicSimilarity], result of:
              0.05771552 = score(doc=169,freq=4.0), product of:
                0.20127523 = queryWeight, product of:
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.054837555 = queryNorm
                0.28674924 = fieldWeight in 169, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=169)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Diachronic word embeddings-vector representations of words over time-offer remarkable insights into the evolution of language and provide a tool for quantifying sociocultural change from text documents. Prior work has used such embeddings to identify shifts in the meaning of individual words. However, simply knowing that a word has changed in meaning is insufficient to identify the instances of word usage that convey the historical meaning or the newer meaning. In this study, we link diachronic word embeddings to documents, by situating those documents as leaders or laggards with respect to ongoing semantic changes. Specifically, we propose a novel method to quantify the degree of semantic progressiveness in each word usage, and then show how these usages can be aggregated to obtain scores for each document. We analyze two large collections of documents, representing legal opinions and scientific articles. Documents that are scored as semantically progressive receive a larger number of citations, indicating that they are especially influential. Our work thus provides a new technique for identifying lexical semantic leaders and demonstrates a new link between progressive use of language and influence in a citation network.
  5. Park, J.S.; O'Brien, J.C.; Cai, C.J.; Ringel Morris, M.; Liang, P.; Bernstein, M.S.: Generative agents : interactive simulacra of human behavior (2023) 0.01
    0.01442888 = product of:
      0.02885776 = sum of:
        0.02885776 = product of:
          0.05771552 = sum of:
            0.05771552 = weight(_text_:work in 972) [ClassicSimilarity], result of:
              0.05771552 = score(doc=972,freq=4.0), product of:
                0.20127523 = queryWeight, product of:
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.054837555 = queryNorm
                0.28674924 = fieldWeight in 972, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=972)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Believable proxies of human behavior can empower interactive applications ranging from immersive environments to rehearsal spaces for interpersonal communication to prototyping tools. In this paper, we introduce generative agents--computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, and head to work; artists paint, while authors write; they form opinions, notice each other, and initiate conversations; they remember and reflect on days past as they plan the next day. To enable generative agents, we describe an architecture that extends a large language model to store a complete record of the agent's experiences using natural language, synthesize those memories over time into higher-level reflections, and retrieve them dynamically to plan behavior. We instantiate generative agents to populate an interactive sandbox environment inspired by The Sims, where end users can interact with a small town of twenty five agents using natural language. In an evaluation, these generative agents produce believable individual and emergent social behaviors: for example, starting with only a single user-specified notion that one agent wants to throw a Valentine's Day party, the agents autonomously spread invitations to the party over the next two days, make new acquaintances, ask each other out on dates to the party, and coordinate to show up for the party together at the right time. We demonstrate through ablation that the components of our agent architecture--observation, planning, and reflection--each contribute critically to the believability of agent behavior. By fusing large language models with computational, interactive agents, this work introduces architectural and interaction patterns for enabling believable simulations of human behavior.
  6. Al-Khatib, K.; Ghosa, T.; Hou, Y.; Waard, A. de; Freitag, D.: Argument mining for scholarly document processing : taking stock and looking ahead (2021) 0.01
    0.014283863 = product of:
      0.028567726 = sum of:
        0.028567726 = product of:
          0.05713545 = sum of:
            0.05713545 = weight(_text_:work in 568) [ClassicSimilarity], result of:
              0.05713545 = score(doc=568,freq=2.0), product of:
                0.20127523 = queryWeight, product of:
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.054837555 = queryNorm
                0.28386727 = fieldWeight in 568, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=568)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Argument mining targets structures in natural language related to interpretation and persuasion. Most scholarly discourse involves interpreting experimental evidence and attempting to persuade other scientists to adopt the same conclusions, which could benefit from argument mining techniques. However, While various argument mining studies have addressed student essays and news articles, those that target scientific discourse are still scarce. This paper surveys existing work in argument mining of scholarly discourse, and provides an overview of current models, data, tasks, and applications. We identify a number of key challenges confronting argument mining in the scientific domain, and suggest some possible solutions and future directions.
  7. Zhang, Y.; Zhang, C.; Li, J.: Joint modeling of characters, words, and conversation contexts for microblog keyphrase extraction (2020) 0.01
    0.010202759 = product of:
      0.020405518 = sum of:
        0.020405518 = product of:
          0.040811036 = sum of:
            0.040811036 = weight(_text_:work in 5816) [ClassicSimilarity], result of:
              0.040811036 = score(doc=5816,freq=2.0), product of:
                0.20127523 = queryWeight, product of:
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.054837555 = queryNorm
                0.20276234 = fieldWeight in 5816, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5816)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Millions of messages are produced on microblog platforms every day, leading to the pressing need for automatic identification of key points from the massive texts. To absorb salient content from the vast bulk of microblog posts, this article focuses on the task of microblog keyphrase extraction. In previous work, most efforts treat messages as independent documents and might suffer from the data sparsity problem exhibited in short and informal microblog posts. On the contrary, we propose to enrich contexts via exploiting conversations initialized by target posts and formed by their replies, which are generally centered around relevant topics to the target posts and therefore helpful for keyphrase identification. Concretely, we present a neural keyphrase extraction framework, which has 2 modules: a conversation context encoder and a keyphrase tagger. The conversation context encoder captures indicative representation from their conversation contexts and feeds the representation into the keyphrase tagger, and the keyphrase tagger extracts salient words from target posts. The 2 modules were trained jointly to optimize the conversation context encoding and keyphrase extraction processes. In the conversation context encoder, we leverage hierarchical structures to capture the word-level indicative representation and message-level indicative representation hierarchically. In both of the modules, we apply character-level representations, which enables the model to explore morphological features and deal with the out-of-vocabulary problem caused by the informal language style of microblog messages. Extensive comparison results on real-life data sets indicate that our model outperforms state-of-the-art models from previous studies.
  8. Lee, G.E.; Sun, A.: Understanding the stability of medical concept embeddings (2021) 0.01
    0.010202759 = product of:
      0.020405518 = sum of:
        0.020405518 = product of:
          0.040811036 = sum of:
            0.040811036 = weight(_text_:work in 159) [ClassicSimilarity], result of:
              0.040811036 = score(doc=159,freq=2.0), product of:
                0.20127523 = queryWeight, product of:
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.054837555 = queryNorm
                0.20276234 = fieldWeight in 159, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=159)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Frequency is one of the major factors for training quality word embeddings. Several studies have recently discussed the stability of word embeddings in general domain and suggested factors influencing the stability. In this work, we conduct a detailed analysis on the stability of concept embeddings in medical domain, particularly in relations with concept frequency. The analysis reveals the surprising high stability of low-frequency concepts: low-frequency (<100) concepts have the same high stability as high-frequency (>1,000) concepts. To develop a deeper understanding of this finding, we propose a new factor, the noisiness of context words, which influences the stability of medical concept embeddings regardless of high or low frequency. We evaluate the proposed factor by showing the linear correlation with the stability of medical concept embeddings. The correlations are clear and consistent with various groups of medical concepts. Based on the linear relations, we make suggestions on ways to adjust the noisiness of context words for the improvement of stability. Finally, we demonstrate that the linear relation of the proposed factor extends to the word embedding stability in general domain.
  9. Xiang, R.; Chersoni, E.; Lu, Q.; Huang, C.-R.; Li, W.; Long, Y.: Lexical data augmentation for sentiment analysis (2021) 0.01
    0.010202759 = product of:
      0.020405518 = sum of:
        0.020405518 = product of:
          0.040811036 = sum of:
            0.040811036 = weight(_text_:work in 392) [ClassicSimilarity], result of:
              0.040811036 = score(doc=392,freq=2.0), product of:
                0.20127523 = queryWeight, product of:
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.054837555 = queryNorm
                0.20276234 = fieldWeight in 392, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=392)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Machine learning methods, especially deep learning models, have achieved impressive performance in various natural language processing tasks including sentiment analysis. However, deep learning models are more demanding for training data. Data augmentation techniques are widely used to generate new instances based on modifications to existing data or relying on external knowledge bases to address annotated data scarcity, which hinders the full potential of machine learning techniques. This paper presents our work using part-of-speech (POS) focused lexical substitution for data augmentation (PLSDA) to enhance the performance of machine learning algorithms in sentiment analysis. We exploit POS information to identify words to be replaced and investigate different augmentation strategies to find semantically related substitutions when generating new instances. The choice of POS tags as well as a variety of strategies such as semantic-based substitution methods and sampling methods are discussed in detail. Performance evaluation focuses on the comparison between PLSDA and two previous lexical substitution-based data augmentation methods, one of which is thesaurus-based, and the other is lexicon manipulation based. Our approach is tested on five English sentiment analysis benchmarks: SST-2, MR, IMDB, Twitter, and AirRecord. Hyperparameters such as the candidate similarity threshold and number of newly generated instances are optimized. Results show that six classifiers (SVM, LSTM, BiLSTM-AT, bidirectional encoder representations from transformers [BERT], XLNet, and RoBERTa) trained with PLSDA achieve accuracy improvement of more than 0.6% comparing to two previous lexical substitution methods averaged on five benchmarks. Introducing POS constraint and well-designed augmentation strategies can improve the reliability of lexical data augmentation methods. Consequently, PLSDA significantly improves the performance of sentiment analysis algorithms.
  10. Laparra, E.; Binford-Walsh, A.; Emerson, K.; Miller, M.L.; López-Hoffman, L.; Currim, F.; Bethard, S.: Addressing structural hurdles for metadata extraction from environmental impact statements (2023) 0.01
    0.010202759 = product of:
      0.020405518 = sum of:
        0.020405518 = product of:
          0.040811036 = sum of:
            0.040811036 = weight(_text_:work in 1042) [ClassicSimilarity], result of:
              0.040811036 = score(doc=1042,freq=2.0), product of:
                0.20127523 = queryWeight, product of:
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.054837555 = queryNorm
                0.20276234 = fieldWeight in 1042, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1042)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Natural language processing techniques can be used to analyze the linguistic content of a document to extract missing pieces of metadata. However, accurate metadata extraction may not depend solely on the linguistics, but also on structural problems such as extremely large documents, unordered multi-file documents, and inconsistency in manually labeled metadata. In this work, we start from two standard machine learning solutions to extract pieces of metadata from Environmental Impact Statements, environmental policy documents that are regularly produced under the US National Environmental Policy Act of 1969. We present a series of experiments where we evaluate how these standard approaches are affected by different issues derived from real-world data. We find that metadata extraction can be strongly influenced by nonlinguistic factors such as document length and volume ordering and that the standard machine learning solutions often do not scale well to long documents. We demonstrate how such solutions can be better adapted to these scenarios, and conclude with suggestions for other NLP practitioners cataloging large document collections.
  11. Luo, L.; Ju, J.; Li, Y.-F.; Haffari, G.; Xiong, B.; Pan, S.: ChatRule: mining logical rules with large language models for knowledge graph reasoning (2023) 0.01
    0.009287165 = product of:
      0.01857433 = sum of:
        0.01857433 = product of:
          0.03714866 = sum of:
            0.03714866 = weight(_text_:22 in 1171) [ClassicSimilarity], result of:
              0.03714866 = score(doc=1171,freq=2.0), product of:
                0.19203177 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.054837555 = queryNorm
                0.19345059 = fieldWeight in 1171, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1171)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Date
    23.11.2023 19:07:22
  12. Aydin, Ö.; Karaarslan, E.: OpenAI ChatGPT generated literature review: : digital twin in healthcare (2022) 0.01
    0.008162207 = product of:
      0.016324414 = sum of:
        0.016324414 = product of:
          0.032648828 = sum of:
            0.032648828 = weight(_text_:work in 851) [ClassicSimilarity], result of:
              0.032648828 = score(doc=851,freq=2.0), product of:
                0.20127523 = queryWeight, product of:
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.054837555 = queryNorm
                0.16220987 = fieldWeight in 851, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.03125 = fieldNorm(doc=851)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Literature review articles are essential to summarize the related work in the selected field. However, covering all related studies takes too much time and effort. This study questions how Artificial Intelligence can be used in this process. We used ChatGPT to create a literature review article to show the stage of the OpenAI ChatGPT artificial intelligence application. As the subject, the applications of Digital Twin in the health field were chosen. Abstracts of the last three years (2020, 2021 and 2022) papers were obtained from the keyword "Digital twin in healthcare" search results on Google Scholar and paraphrased by ChatGPT. Later on, we asked ChatGPT questions. The results are promising; however, the paraphrased parts had significant matches when checked with the Ithenticate tool. This article is the first attempt to show the compilation and expression of knowledge will be accelerated with the help of artificial intelligence. We are still at the beginning of such advances. The future academic publishing process will require less human effort, which in turn will allow academics to focus on their studies. In future studies, we will monitor citations to this study to evaluate the academic validity of the content produced by the ChatGPT. 1. Introduction OpenAI ChatGPT (ChatGPT, 2022) is a chatbot based on the OpenAI GPT-3 language model. It is designed to generate human-like text responses to user input in a conversational context. OpenAI ChatGPT is trained on a large dataset of human conversations and can be used to create responses to a wide range of topics and prompts. The chatbot can be used for customer service, content creation, and language translation tasks, creating replies in multiple languages. OpenAI ChatGPT is available through the OpenAI API, which allows developers to access and integrate the chatbot into their applications and systems. OpenAI ChatGPT is a variant of the GPT (Generative Pre-trained Transformer) language model developed by OpenAI. It is designed to generate human-like text, allowing it to engage in conversation with users naturally and intuitively. OpenAI ChatGPT is trained on a large dataset of human conversations, allowing it to understand and respond to a wide range of topics and contexts. It can be used in various applications, such as chatbots, customer service agents, and language translation systems. OpenAI ChatGPT is a state-of-the-art language model able to generate coherent and natural text that can be indistinguishable from text written by a human. As an artificial intelligence, ChatGPT may need help to change academic writing practices. However, it can provide information and guidance on ways to improve people's academic writing skills.
  13. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D.M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, A.; Sutskever, I.; Amodei, D.: Language models are few-shot learners (2020) 0.01
    0.008162207 = product of:
      0.016324414 = sum of:
        0.016324414 = product of:
          0.032648828 = sum of:
            0.032648828 = weight(_text_:work in 872) [ClassicSimilarity], result of:
              0.032648828 = score(doc=872,freq=2.0), product of:
                0.20127523 = queryWeight, product of:
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.054837555 = queryNorm
                0.16220987 = fieldWeight in 872, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.03125 = fieldNorm(doc=872)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
  14. Jha, A.: Why GPT-4 isn't all it's cracked up to be (2023) 0.01
    0.0071419314 = product of:
      0.014283863 = sum of:
        0.014283863 = product of:
          0.028567726 = sum of:
            0.028567726 = weight(_text_:work in 923) [ClassicSimilarity], result of:
              0.028567726 = score(doc=923,freq=2.0), product of:
                0.20127523 = queryWeight, product of:
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.054837555 = queryNorm
                0.14193363 = fieldWeight in 923, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.6703904 = idf(docFreq=3060, maxDocs=44218)
                  0.02734375 = fieldNorm(doc=923)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    He doesn't dismiss the potential of LLMs to become useful assistants in all sorts of ways-Google and Microsoft have already announced that they will be integrating LLMs into their search and office productivity software. But he talked me through some of his criticisms of the technology's apparent capabilities. At the heart of Dr Marcus's thoughtful critique is an attempt to put LLMs into proper context. Deep learning, the underlying technology that makes LLMs work, is only one piece of the puzzle in the quest for machine intelligence. To reach the level of artificial general intelligence (AGI) that many tech companies strive for-i.e. machines that can plan, reason and solve problems in the way human brains can-they will need to deploy a suite of other AI techniques. These include, for example, the kind of "symbolic AI" that was popular before artificial neural networks and deep learning became all the rage.