Search (9 results, page 1 of 1)

Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Yuille, A.L.: Explain images with multimodal recurrent neural networks (2014) 0.01
```
0.01102379 = product of:
  0.03307137 = sum of:
    0.03307137 = product of:
      0.0992141 = sum of:
        0.0992141 = weight(_text_:network in 1557) [ClassicSimilarity], result of:
          0.0992141 = score(doc=1557,freq=6.0), product of:
            0.19402927 = queryWeight, product of:
              4.4533744 = idf(docFreq=1398, maxDocs=44218)
              0.043569047 = queryNorm
            0.51133573 = fieldWeight in 1557, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              4.4533744 = idf(docFreq=1398, maxDocs=44218)
              0.046875 = fieldNorm(doc=1557)
      0.33333334 = coord(1/3)
  0.33333334 = coord(1/3)
```
Abstract

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel sentence descriptions to explain the content of images. It directly models the probability distribution of generating a word given previous words and the image. Image descriptions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on three benchmark datasets (IAPR TC-12 [8], Flickr 8K [28], and Flickr 30K [13]). Our model outperforms the state-of-the-art generative method. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.
Karpathy, A.; Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions (2015) 0.01
```
0.0063645868 = product of:
  0.01909376 = sum of:
    0.01909376 = product of:
      0.057281278 = sum of:
        0.057281278 = weight(_text_:network in 1868) [ClassicSimilarity], result of:
          0.057281278 = score(doc=1868,freq=2.0), product of:
            0.19402927 = queryWeight, product of:
              4.4533744 = idf(docFreq=1398, maxDocs=44218)
              0.043569047 = queryNorm
            0.29521978 = fieldWeight in 1868, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.4533744 = idf(docFreq=1398, maxDocs=44218)
              0.046875 = fieldNorm(doc=1868)
      0.33333334 = coord(1/3)
  0.33333334 = coord(1/3)
```
Abstract

We present a model that generates free-form natural language descriptions of image regions. Our model leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between text and visual data. Our approach is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate the effectiveness of our alignment model with ranking experiments on Flickr8K, Flickr30K and COCO datasets, where we substantially improve on the state of the art. We then show that the sentences created by our generative model outperform retrieval baselines on the three aforementioned datasets and a new dataset of region-level annotations.
Kiros, R.; Salakhutdinov, R.; Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models (2014) 0.01
```
0.0063645868 = product of:
  0.01909376 = sum of:
    0.01909376 = product of:
      0.057281278 = sum of:
        0.057281278 = weight(_text_:network in 1871) [ClassicSimilarity], result of:
          0.057281278 = score(doc=1871,freq=2.0), product of:
            0.19402927 = queryWeight, product of:
              4.4533744 = idf(docFreq=1398, maxDocs=44218)
              0.043569047 = queryNorm
            0.29521978 = fieldWeight in 1871, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.4533744 = idf(docFreq=1398, maxDocs=44218)
              0.046875 = fieldNorm(doc=1871)
      0.33333334 = coord(1/3)
  0.33333334 = coord(1/3)
```
Abstract

Inspired by recent advances in multimodal learning and machine translation, we introduce an encoder-decoder pipeline that learns (a): a multimodal joint embedding space with images and text and (b): a novel language model for decoding distributed representations from our space. Our pipeline effectively unifies joint image-text embedding models with multimodal neural language models. We introduce the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder. The encoder allows one to rank images and sentences while the decoder can generate novel descriptions from scratch. Using LSTM to encode sentences, we match the state-of-the-art performance on Flickr8K and Flickr30K without using object detections. We also set new best results when using the 19-layer Oxford convolutional network. Furthermore we show that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic e.g. *image of a blue car* - "blue" + "red" is near images of red cars. Sample captions generated for 800 images are made available for comparison.
Donahue, J.; Hendricks, L.A.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description (2014) 0.01
```
0.0053038225 = product of:
  0.015911467 = sum of:
    0.015911467 = product of:
      0.047734402 = sum of:
        0.047734402 = weight(_text_:network in 1873) [ClassicSimilarity], result of:
          0.047734402 = score(doc=1873,freq=2.0), product of:
            0.19402927 = queryWeight, product of:
              4.4533744 = idf(docFreq=1398, maxDocs=44218)
              0.043569047 = queryNorm
            0.2460165 = fieldWeight in 1873, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.4533744 = idf(docFreq=1398, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1873)
      0.33333334 = coord(1/3)
  0.33333334 = coord(1/3)
```
Abstract

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep" in that they can be compositional in spatial and temporal "layers". Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual representations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D.: ¬A picture is worth a thousand (coherent) words : building a natural description of images (2014) 0.01
```
0.005250517 = product of:
  0.01575155 = sum of:
    0.01575155 = product of:
      0.047254648 = sum of:
        0.047254648 = weight(_text_:network in 1874) [ClassicSimilarity], result of:
          0.047254648 = score(doc=1874,freq=4.0), product of:
            0.19402927 = queryWeight, product of:
              4.4533744 = idf(docFreq=1398, maxDocs=44218)
              0.043569047 = queryNorm
            0.24354391 = fieldWeight in 1874, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              4.4533744 = idf(docFreq=1398, maxDocs=44218)
              0.02734375 = fieldNorm(doc=1874)
      0.33333334 = coord(1/3)
  0.33333334 = coord(1/3)
```
Content

"People can summarize a complex scene in a few words without thinking twice. It's much more difficult for computers. But we've just gotten a bit closer -- we've developed a machine-learning system that can automatically produce captions (like the three above) to accurately describe images the first time it sees them. This kind of system could eventually help visually impaired people understand pictures, provide alternate text for images in parts of the world where mobile connections are slow, and make it easier for everyone to search on Google for images. Recent research has greatly improved object detection, classification, and labeling. But accurately describing a complex scene requires a deeper representation of what's going on in the scene, capturing how the various objects relate to one another and translating it all into natural-sounding language. Many efforts to construct computer-generated natural descriptions of images propose combining current state-of-the-art techniques in both computer vision and natural language processing to form a complete image description approach. But what if we instead merged recent computer vision and language models into a single jointly trained system, taking an image and directly producing a human readable sequence of words to describe it? This idea comes from recent advances in machine translation between languages, where a Recurrent Neural Network (RNN) transforms, say, a French sentence into a vector representation, and a second RNN uses that vector representation to generate a target sentence in German. Now, what if we replaced that first RNN and its input words with a deep Convolutional Neural Network (CNN) trained to classify objects in images? Normally, the CNN's last layer is used in a final Softmax among known classes of objects, assigning a probability that each object might be in the image. But if we remove that final layer, we can instead feed the CNN's rich encoding of the image into a RNN designed to produce phrases. We can then train the whole system directly on images and their captions, so it maximizes the likelihood that descriptions it produces best match the training descriptions for each image.

Schöneberg, U.; Gödert, W.: Erschließung mathematischer Publikationen mittels linguistischer Verfahren (2012) 0.00

0.003971059 = product of:
  0.011913176 = sum of:
    0.011913176 = product of:
      0.035739526 = sum of:
        0.035739526 = weight(_text_:29 in 1055) [ClassicSimilarity], result of:
          0.035739526 = score(doc=1055,freq=2.0), product of:
            0.15326229 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.043569047 = queryNorm
            0.23319192 = fieldWeight in 1055, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.046875 = fieldNorm(doc=1055)
      0.33333334 = coord(1/3)
  0.33333334 = coord(1/3)

Date: 12. 9.2013 12:29:05

Banerjee, K.; Johnson, M.: Improving access to archival collections with automated entity extraction (2015) 0.00

0.003971059 = product of:
  0.011913176 = sum of:
    0.011913176 = product of:
      0.035739526 = sum of:
        0.035739526 = weight(_text_:29 in 2144) [ClassicSimilarity], result of:
          0.035739526 = score(doc=2144,freq=2.0), product of:
            0.15326229 = queryWeight, product of:
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.043569047 = queryNorm
            0.23319192 = fieldWeight in 2144, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5176873 = idf(docFreq=3565, maxDocs=44218)
              0.046875 = fieldNorm(doc=2144)
      0.33333334 = coord(1/3)
  0.33333334 = coord(1/3)

Source: Code4Lib journal. Issue 29(2015), [http://journal.code4lib.org/issues/issues/issue29]

Junger, U.; Schwens, U.: ¬Die inhaltliche Erschließung des schriftlichen kulturellen Erbes auf dem Weg in die Zukunft : Automatische Vergabe von Schlagwörtern in der Deutschen Nationalbibliothek (2017) 0.00

0.003279447 = product of:
  0.009838341 = sum of:
    0.009838341 = product of:
      0.029515022 = sum of:
        0.029515022 = weight(_text_:22 in 3780) [ClassicSimilarity], result of:
          0.029515022 = score(doc=3780,freq=2.0), product of:
            0.15257138 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.043569047 = queryNorm
            0.19345059 = fieldWeight in 3780, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3780)
      0.33333334 = coord(1/3)
  0.33333334 = coord(1/3)

Date: 19. 8.2017 9:24:22

Markoff, J.: Researchers announce advance in image-recognition software (2014) 0.00
```
0.0026519112 = product of:
  0.007955734 = sum of:
    0.007955734 = product of:
      0.023867201 = sum of:
        0.023867201 = weight(_text_:network in 1875) [ClassicSimilarity], result of:
          0.023867201 = score(doc=1875,freq=2.0), product of:
            0.19402927 = queryWeight, product of:
              4.4533744 = idf(docFreq=1398, maxDocs=44218)
              0.043569047 = queryNorm
            0.12300825 = fieldWeight in 1875, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              4.4533744 = idf(docFreq=1398, maxDocs=44218)
              0.01953125 = fieldNorm(doc=1875)
      0.33333334 = coord(1/3)
  0.33333334 = coord(1/3)
```
Content

Computer vision specialists said that despite the improvements, these software systems had made only limited progress toward the goal of digitally duplicating human vision and, even more elusive, understanding. "I don't know that I would say this is 'understanding' in the sense we want," said John R. Smith, a senior manager at I.B.M.'s T.J. Watson Research Center in Yorktown Heights, N.Y. "I think even the ability to generate language here is very limited." But the Google and Stanford teams said that they expect to see significant increases in accuracy as they improve their software and train these programs with larger sets of annotated images. A research group led by Tamara L. Berg, a computer scientist at the University of North Carolina at Chapel Hill, is training a neural network with one million images annotated by humans. "You're trying to tell the story behind the image," she said. "A natural scene will be very complex, and you want to pick out the most important objects in the image.""

Search (9 results, page 1 of 1)

Authors

Languages

Types