Search (1 results, page 1 of 1)

  • × author_ss:"Erhan, D."
  • × type_ss:"el"
  1. Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D.: ¬A picture is worth a thousand (coherent) words : building a natural description of images (2014) 0.03
    0.026673492 = product of:
      0.053346984 = sum of:
        0.053346984 = product of:
          0.10669397 = sum of:
            0.10669397 = weight(_text_:translating in 1874) [ClassicSimilarity], result of:
              0.10669397 = score(doc=1874,freq=2.0), product of:
                0.36826274 = queryWeight, product of:
                  7.4921947 = idf(docFreq=66, maxDocs=44218)
                  0.04915285 = queryNorm
                0.2897224 = fieldWeight in 1874, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  7.4921947 = idf(docFreq=66, maxDocs=44218)
                  0.02734375 = fieldNorm(doc=1874)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Content
    "People can summarize a complex scene in a few words without thinking twice. It's much more difficult for computers. But we've just gotten a bit closer -- we've developed a machine-learning system that can automatically produce captions (like the three above) to accurately describe images the first time it sees them. This kind of system could eventually help visually impaired people understand pictures, provide alternate text for images in parts of the world where mobile connections are slow, and make it easier for everyone to search on Google for images. Recent research has greatly improved object detection, classification, and labeling. But accurately describing a complex scene requires a deeper representation of what's going on in the scene, capturing how the various objects relate to one another and translating it all into natural-sounding language. Many efforts to construct computer-generated natural descriptions of images propose combining current state-of-the-art techniques in both computer vision and natural language processing to form a complete image description approach. But what if we instead merged recent computer vision and language models into a single jointly trained system, taking an image and directly producing a human readable sequence of words to describe it? This idea comes from recent advances in machine translation between languages, where a Recurrent Neural Network (RNN) transforms, say, a French sentence into a vector representation, and a second RNN uses that vector representation to generate a target sentence in German. Now, what if we replaced that first RNN and its input words with a deep Convolutional Neural Network (CNN) trained to classify objects in images? Normally, the CNN's last layer is used in a final Softmax among known classes of objects, assigning a probability that each object might be in the image. But if we remove that final layer, we can instead feed the CNN's rich encoding of the image into a RNN designed to produce phrases. We can then train the whole system directly on images and their captions, so it maximizes the likelihood that descriptions it produces best match the training descriptions for each image.