Search (2 results, page 1 of 1)
- Did you mean:
- author's%3a%22Zurawski%2c A.%22 2
- authors%3a%22Zurawski%2c A.%22 2
-
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amode, D.; Sutskever, I.: Language models are unsupervised multitask learners
0.00
0.0028703054 = product of: 0.005740611 = sum of: 0.005740611 = product of: 0.011481222 = sum of: 0.011481222 = weight(_text_:a in 871) [ClassicSimilarity], result of: 0.011481222 = score(doc=871,freq=16.0), product of: 0.053105544 = queryWeight, product of: 1.153047 = idf(docFreq=37942, maxDocs=44218) 0.046056706 = queryNorm 0.2161963 = fieldWeight in 871, product of: 4.0 = tf(freq=16.0), with freq of: 16.0 = termFreq=16.0 1.153047 = idf(docFreq=37942, maxDocs=44218) 0.046875 = fieldNorm(doc=871) 0.5 = coord(1/2) 0.5 = coord(1/2)
- Abstract
- Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on task-specific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
- Type
- a
-
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D.M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, A.; Sutskever, I.; Amodei, D.: Language models are few-shot learners (2020)
0.00
0.0023435948 = product of: 0.0046871896 = sum of: 0.0046871896 = product of: 0.009374379 = sum of: 0.009374379 = weight(_text_:a in 872) [ClassicSimilarity], result of: 0.009374379 = score(doc=872,freq=24.0), product of: 0.053105544 = queryWeight, product of: 1.153047 = idf(docFreq=37942, maxDocs=44218) 0.046056706 = queryNorm 0.17652355 = fieldWeight in 872, product of: 4.8989797 = tf(freq=24.0), with freq of: 24.0 = termFreq=24.0 1.153047 = idf(docFreq=37942, maxDocs=44218) 0.03125 = fieldNorm(doc=872) 0.5 = coord(1/2) 0.5 = coord(1/2)
- Abstract
- Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
- Type
- a
Authors
- Child, R. 2
- Sutskever, I. 2
- Agarwal, S. 1
- Amode, D. 1
- Amodei, D. 1
- Askell, A. 1
- Berner, C. 1
- Brown, T.B. 1
- Chen, M. 1
- Chess, B. 1
- Clark, J. 1
- Dhariwal, P. 1
- Gray, S. 1
- Henighan, T. 1
- Herbert-Voss, A. 1
- Hesse, C. 1
- Kaplan, J. 1
- Krueger, G. 1
- Litwin, M. 1
- Luan, D. 1
- Mann, B. 1
- McCandlish, S. 1
- Neelakantan, A. 1
- Ramesh, A. 1
- Ryder, N. 1
- Sastry, G. 1
- Shyam, P. 1
- Sigler, E. 1
- Subbiah, M. 1
- Winter, C. 1
- Ziegler, D.M. 1
- More… Less…