zoukankan      html  css  js  c++  java
  • 论文:Show and Tell: A Neural Image Caption Generator-阅读总结

    Show and Tell: A Neural Image Caption Generator-阅读总结

    笔记不能简单的抄写文中的内容,得有自己的思考和理解。

    一、基本信息

    标题 作者 作者单位 发表期刊/会议 发表时间
    Show and Tell:
    A Neural Image Caption Generator
    Oriol Vinyals
    vinyals@google.com
    Alexander
    Toshevtoshev@google.com
    Samy Bengio
    bengio@google.com
    Dumitru Erhan
    dumitru@google.com
    Google_DeepMind
    Google
    Google Brain
    Google Brain
    CVPR 2015

    二、看本篇论文的目的

    了解image caption 的较为早期的神经网络相关的研究成果。

    三、场景和问题

    scene: computer vision and natural language processing, natural image.

    problem: automatically describing the content of an image using properly formed English sentences.

    四、研究目标

    present a single joint model that is more accurate, both qualitatively and quantitatively, and it just takes an image I as input, and is trained to maximize the likelihood p(S|I) of producing a target sequences of words (S={{S_1,S_2,...}}) where each word (S_t) comes from a given dictionary, that describes the image adequately.

    五、主要思路/创新

    Main inspiration:

    Recent advances in machine translation---to transform a sentence S written in a source language, into its translation T in the target language, by maximizing (p(T|S)). An "encoder" RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn in used as the initial hidden state of a "decoder" RNN that generates the target sentence.

    Main innovation:

    Replacing the encoder RNN by a deep CNN(using a CNN as an image "encoder"). CNNs can produce a rich representation of the input image by embedding it to a fixed-length vector. By pre-training the CNN for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences.

    Using stochastic gradient decent.

    Model name:

    An end-to-end system: Neural Image Caption--NIC.

    六、核心算法

    1.Directly maximize the probability of the correct description given the image by using formulation:

    [ heta^*=argmathop{max}limits_{ heta}sum_{(I,S)}log\,p(S|I; heta) ]

    ( heta) are the parameters of the model, (I) is an image, (S) is its correct transcription.

    (N) is the length of the sentence: (log\, p(S|I)=sum_limits{t=0}^N log\,p(S_t|I,S_0,...,S_{t-1}))

    2.Model (p(S_t|I,S_0,...,S_{t-1})) with a RNN, the variable number of words which we condition upon up to (t-1) is expressed by a fixed length hidden state or memory (h_t). The (h_t) is updated after seeing a new input (x_t) by using a non-linear function (f) : (h_{t+1}=f(h_t,x_t)).

    3.CNN uses a novel approachto batch normalization and yields the current best performance on the ILSVRC 2014 classification competition. (f) is a Long-Short Term Memory (LSTM) net.

    4.Implementation details of LSTM model:

    七、模型实现细节

    The unrolling procedure:

    [x_{-1}=CNN(I)\ x_t=W_eS_t,quad tin {0...N-1}\ p_{t+1}=LSTM(x_t),quad tin{0...N-1} ]

    1.each word is represented as a one-hot vector (S_t) of dimension equal to the size of the dictionary.

    2.(S_0) is a special denoted start word and (S_N) is a special denoted stop word which designates the start and end of the sentence.

    3.both the iamge and the words are mapped to the same space, the image by using a vision CNN, the words by using word embedding (W_e).

    4.The image is only input once, at (t=-1), to inform the LSTM about the image contents.

    5.the loss is the sum of the negative log likelihood of the correct word at each step:(L(I,S)=-sum_limits{t=1}^Nlog\, p_t(S_t)).

    6.two sentence generation approaches:

    the first one: Sampling-just sample the first word according to (p_1), then provide the corresponding embedding as input and sample (p_2), continuing like this until we sample the special end-of-sentence token or some maximum length.

    the second one: BeamSearch-iteratively consider the set of the k best sentences up to time (t) as candidates to generate sentences of size (t+1), and keep only the resulting best k of them. This better approximates (S=arg\,max_{S'}\,p(S'|I)).

    this paper used the BeamSearch approach in the experimens, beam size is 20 and it's verified that beam size of 1 did degrade the results by 2 BLEU points on average.

    八、采用的评价指标&数据集

    1.Amazon Mechanical Turk: The most reliable metrics-ask for raters to give a subjective score on the usefulness of each description given the iamge-each image was rated by 2 workers, in case of disagreement, just simply average the scores.

    2.Automatically computed metrics: BLEU score, METEOR, Cider

    3.Datasets:

    九、实验细节

    Training Details:

    1.Overfitting: high quality datasets have less than 100000 images

    --initialize the weights of the CNN component of the system to a pretrained model.

    --initialize the (W_e), the word embeddings from a large news corpus but no significant gains were observed, so just leave them uninitialized for simplicity.

    --some model level overfitting-avoiding techniques: dropout, ensembling models, exploring the size of the model by trading off number of hidden units versus depth.(dropout and ensembling gave a few BLEU points improvement)

    --using 512 dimensions for the embeddings and the size of the LSTM memory.

    十、验证的问题&效果

    Question 1:

    whether we could transfer a model to a different dataset, and how much the mismatch in domain would be compensated with e.g. higer quality labels or more training data.

    transfer learning -- data size:

    1.Flickr30k and Flickr8k (30k is about 4 times more training data than 8k): Flickr30k results obtained 4 BLEU points better.

    2.MSCOCO and Flickr30k (MSCOCO has 5 times more training data than 30k): more differences in vocabulary and a larger mismatch, all the BLEU scores degrade by 10 points.

    3.PASCAL - transfer learning from Flickr30k, yielded worse results - BLEU-1 at 53 (cf. 59).

    4.SBU (i.e., the labels were captions and not human generated descriptions): task is much harder with a much larger and noiser vocabulary, transfer MSCOCO model on SBU - performance degrades from 28 down to 16.

    Question 2:

    whther the model generates novel captions, and whether the generated captions are both diverse and high quality.

    1.agreement in BLEU score between the top 15 genetated sentences is 58 .

    2.80% of the best candidates are present in the training set, beacuse of the small amount of training data.

    3.half of the times, we can see a completely novel description when analyze the top 15 generated sentences.

    Question 3:

    how the learned representations have captured some semantic from the statistics of the language.

    hypothesis: a few examples of a class (e.g., "unicorn"), its proximity to other word embeddings (e.g., "horse") should provide a lot more information that would be completely lost with more traditional bag-of-words based approaches.

  • 相关阅读:
    (双指针 二分) leetcode 167. Two Sum II
    (双指针) leetcode 485. Max Consecutive Ones
    (双指针) leetcode 27. Remove Element
    (String) leetcode 67. Add Binary
    (数组) leetcode 66. Plus One
    (N叉树 BFS) leetcode429. N-ary Tree Level Order Traversal
    (N叉树 递归) leetcode 590. N-ary Tree Postorder Traversal
    (N叉树 递归) leetcode589. N-ary Tree Preorder Traversal
    (N叉树 DFS 递归 BFS) leetcode 559. Maximum Depth of N-ary Tree
    (BST 递归) leetcode98. Validate Binary Search Tree
  • 原文地址:https://www.cnblogs.com/phoenixash/p/12335193.html
Copyright © 2011-2022 走看看