分享到:106
本文由 伯乐在线 - XiaoxiaoLi 翻译。未经许可,禁止转载!
英文出处:colah.github.io。欢迎加入翻译组。
简介
过去几年,深度神经网络在模式识别中占绝对主流。它们在许多计算机视觉任务中完爆之前的顶尖算法。在语音识别上也有这个趋势了。
虽然结果好,我们也必须思考……它们为什么这么好使?
在这篇文章里,我综述一下在自然语言处理(NLP)上应用深度神经网络得到的一些效果极其显著的成果。我希望能提供一个能解释为何深度神经网络好用的理由。我认为这是个非常简练而优美的视角。
单隐层神经网络
单隐层神经网络有一个普适性(universality):给予足够的隐结点,它可以估算任何函数。这是一个经常被引用的理论,它被误解和应用的次数就更多了。
本质上这个理论是正确的,因为隐层可以用来做查询表。
简单点,我们来看一个感知器网络(perceptron network)。感知器 (perceptron)是非常简单的神经元,如果超过一个阈值它就会被启动,如果没超过改阈值它就没反应。感知器网络的输入和输出都是是二进制的(0和1)。
注意可能的输入个数是有限的。对每个可能的输入,我们可以在隐层里面构建一个只对这个输入有反应的神经元(见注解1)。然后我们可以利用这个神经元和输出神经元之间的连接来控制这个输入下得到的结果(见注解2)。
7cc829d3gw1elil6rswhdj20oe0iddhq.jpg
这样可以说明单隐层神经网络的确是有普适性的。但是这也没啥了不起的呀。你的模型能干和查询表一样的事并不能说明你的模型有任何优点。这只能说明用你的模型来完成任务并不是不可能的罢了。
普适性的真正意义是:一个网络能适应任何你给它的训练数据。这并不代表插入新的数据点的时候它能表现地很理想。
所以普适性并不能解释为什么神经网络如此好用。真正的原因比这微妙得多… 为了理解它,我们需要先理解一些具体的成果。
单词嵌入(Word Embeddings)
我想从深度学习研究的一个非常有意思的部分讲起,它就是:单词嵌入(word embeddings)。在我看来,单词嵌入是目前深度学习最让人兴奋的领域之一,尽管它最早是由Bengio等人在十多年前提出的(见注解3)。除此之外,我认为它们能帮助你通过直觉来了解为什么深度学习如此有效。
单词嵌入W:words→Rn是一个参数化函数,它把某个语言里的单词映射成高维向量(大概200到500维)。例如这样:
W(‘‘cat”)=(0.2, -0.4, 0.7, …)
W(‘‘mat”)=(0.0, 0.6, -0.1, …)
(一般这个函数就是一个查询表,用一个矩阵θ来参数化,每行是一个单词:Wθ(wn)=θn.)
初始化时,W中每个词对应一个随机的向量。它会学习出有意义的向量以便执行任务。
举个一个可能的任务的例子:训练一个网络让其预测一个5元组(5-gram)(连续的5个词)是否‘成立’。我们可以随便从维基百科上选一堆5元组(比如cat sat on the mat)然后把其中一个词随便换成另外一个词(比如cat sat song the mat),那么一半的5元组估计都会变得荒谬且没意义了。
7cc829d3gw1elil6sfi7kj20dt0970t7.jpg
判断5元组是否成立的模块网络(来自于Bottou (2011))
我们训练的模型会通过W把5元组中每个词的表征向量取出来,输入给另外一个叫R的模块,模块R会试图预测这个5元组是‘成立的’或者是‘破碎的’。然后我们希望看见:
R(W(‘‘cat”), W(‘‘sat”), W(‘‘on”), W(‘‘the”), W(‘‘mat”))=1
R(W(‘‘cat”), W(‘‘sat”), W(‘‘song”), W(‘‘the”), W(‘‘mat”))=0
为了准确地预测这些值,这个网络需要从W以及R中学习到好的参数。
现在看来这个任务并没什么意思。也许它能用来检测语法错误什么的,没什么大不了。但是极其有趣的部分是这个W。
(事实上,对我们来说,这个任务的意义就是学习W。我们当然也可以做一些其他的任务 – 一个很常见的任务是预测句子中下一个单词。但我们实际上并不在乎任务是什么。这节后面我们会谈到许多单词嵌入成果,但并不会区分得到这些成果的方法的不同。)
想直观感受一下单词嵌入空间的话,我们可以用t-SNE来对它进行可视化。t-SNE是一个复杂的高维数据可视化技术。
7cc829d3gw1elil6t3pt9j21do0hzwfe.jpg
t-SNE对单词嵌入的可视化结果。左图:数字区间。右图:工作岗位区间。来源:Turian et al. (2010),全图在此
这种单词构成的“地图”对我们来说更直观。相似的词离得近。另一种方法是看对一个给定单词来说,哪些其他的单词离它最近。我们可以再一次看到,这些词都很相似。
7cc829d3gw1elil6twme1j211u0dctd6.jpg
哪些词的嵌入离一个给定词最近?来自于Collobertet al. (2011)
网络能让意义相似的词拥有相似的向量,这看起来是很自然的事。如果你把一个词换成它的同义词(例如 “a few people sing well” → “a couple people sing well”),句子的成立性并没有变化。虽然从字面上看,句子变化很大,但如果W把同义词(像“few”和”couple”这种)映射到相近的空间,从R的角度来看句子的变化很小。
这就牛了。可能的5元组的数目是巨大的,相比之下我们的训练数据量很小。相似的单词距离近能让我们从一个句子演变出一类相似的句子。这不仅指把一个词替换成一个它的同义词,而且指把一个词换成一个相似类别里面的词(如“the wall is blue” → “the wall is red” )。进一步地,我们可以替换多个单词(例如“the wall is blue” → “the ceiling is red”)。它的影响对单词数目来说是指数级的 (参见注解4)。
很明显,这是W的一个用武之地。但它是如何学会做这个的呢?看起来很可能很多情况下它是先知道“the wall is blue”这样的句子是成立的,然后才见到“the wall is red”这样的句子。这样的话,把“red”往”blue”那边挪近一点,网络的效果就更好。
我们并没见过每个单词使用的例子,但是类比能让我们泛化衍生出新的单词组合。你懂的单词你都见过,但是你能懂的句子你并没有都见过。神经网络也是如此。
7cc829d3gw1elil6uosqhj2074057dfs.jpg
单词嵌入展示了一个更引人注目的属性:单词间的类比仿佛是被编码在了单词向量的区别中。比如,这个看来是个男-女区别向量:
W(‘‘woman”)−W(‘‘man”) ≃ W(‘‘aunt”)−W(‘‘uncle”)
W(‘‘woman”)−W(‘‘man”) ≃ W(‘‘queen”)−W(‘‘king”)
也许这看起来并不奇怪。毕竟表性别的代词意味着换一个词整个句子的语法就错了。正常话是这么说的 “she is the aunt” ,“he is the uncle.”。同样的,“he is the King”, “she is the Queen.”。如果你看见“she is the uncle,” 最可能的解释就是这句话有语法错误。这个情况看起来很可能是:一半的时候单词都被随机地替换了。
也许我们会放马后炮:“当然是这样啦!单词嵌入会学着把性别按照一致的方式来编码。事实上也许就存在一个性别的维度。对单复数来说也是一样。找出这些明显的关系太简单了!”
然而,更复杂的关系也是这样被编码的。这看起来几乎像奇迹一样!
7cc829d3gw1elil6veq2cj20rs0bywif.jpg
单词嵌入中的关系对。来自 Mikolov et al. (2013b).
能够充分意识到W的这些属性不过是副产品而已是很重要的。我们没有尝试着让相似的词离得近。我们没想把类比编码进不同的向量里。我们想做的不过是一个简单的任务,比如预测一个句子是不是成立的。这些属性大概也就是在优化过程中自动蹦出来的。
这看来是神经网络的一个非常强大的优点:它们能自动学习更好的数据表征的方法。反过来讲,能有效地表示数据对许多机器学习问题的成功都是必不可少的。单词嵌入仅仅是学习数据表示中一个引人注目的例子而已。
共同表征
单词嵌入的这些属性当然非常有意思,但是除了判断5元组是不是成立这种傻问题还能干点啥有用的么?
7cc829d3gw1elil6w9t0tj208r0dqaaf.jpg
W和F学习完成任务A, G可以根据W来学习完成任务B
之前我们学习单词嵌入是为了在简单任务上有出色的表现,但基于我们从单词嵌入中发现的好属性,你也许会猜想它们对自然语言处理任务整体都适用。实际上,这样的单词特征表示(word representations)是极其有用的:
“利用单词特征表示…已经成为近年来许多NLP系统成功的秘密武器,包括命名实体识别,词性标注,语法分析和语义角色标注。(Luong et al. (2013) ”
在深度学习工具箱里,把从任务A中学到的好表征方法用在任务B上是一个很主要的技巧。根据细节不同,这个普遍的技巧的名称也不同,如:预训练(pretraining),迁移学习(transfer learning),多任务学习(multi-task learning)等。这种方法的好处之一是可以从多种不同数据中学习特征表示。
这个技巧有个对应面。除了在一种数据上学习表征然后应用在不同任务上,我们还可以从多种数据中学习出一种单个的表征!
一个很好的例子就是Socher et al. (2013a) 提出的双语单词嵌入。我们可以从两种不同语言中把单词嵌入到一个共享的空间去。在这个例子里,我们学习把汉语和英语嵌入到同一个空间去。
7cc829d3gw1elil6x7wbuj20da0cd0ti.jpg
我们用和上面差不多的方法来训练Wen和Wzh两种嵌入。但是,我们已知某些中文和英文的词汇有相似的意思。所以,我们追加一个属性优化:我们已知的翻译过后意思相似的词应该离得更近。
理所当然,我们会发现我们已知的有相似意思的词在最后结果中离得很近。我们本来就是针对这个做的优化,这个结果没什么让人惊讶的。但更有意思的是我们未知的翻译后意思相似的词结果距离也很近。
鉴于我们前面有关单词嵌入的经验,这个也许并不太让你感到惊奇。单词嵌入就是会把相似的词聚到一起,所以如果我们已知的中英词汇离得近,它们的同义词自然离得近。我们还知道类似性别差异趋向于可以用一个常数的差异向量表示。看起来,对齐足够多的点会让这些差异向量在中文和英文的嵌入中保持一致。这样会导致如果我们已知两个男性词互为翻译,最后我们也会得到一对互为翻译的女性词。
直观来讲,仿佛就是两种语言有着相似的“形状”,通过对齐不同的点,两种语言就能够重叠,其他的点就自然能被放在正确的位置上。
7cc829d3gw1elil6ywvcvj20h10cu76q.jpg
双语单词嵌入的t-SNE可视化图。绿色是中文,黄色是英文。来自(Socher et al. (2013a))
在双语单词嵌入中,我们对两种很相似的数据学习了一个共享表征。我们也可以学习把非常不同的几种数据嵌入到同一个空间去。
7cc829d3gw1elil6zq85ej20db0cwdge.jpg
近期,深度学习已经开始探索能够把单词和图像嵌入到同一个表征下的模型(参见注解5)。
基本思路就是你可以通过单词嵌入输出的向量来对图像进行分类。狗的图像会被映射到“狗”的单词向量附近。马的图像会被映射到“马”的单词向量附近。汽车的图像会被映射到“汽车”的单词向量附近。以此类推。
有趣的是如果你用新类别的图像来测试这个模型会发生什么呢?比如,如果这个模型没训练过如何分类“猫”,也就是把猫的图像映射到“猫”向量附近,那当我们试图对猫的图像进行分类的时候会发生什么呢?
7cc829d3gw1elil70jun0j21620rraew.jpg
结果表明,这个网络是可以很合理地处理新类别的图像的。猫的图片并没有被映射到单词嵌入空间的随机的点中。相反的,他们更倾向于被映射到整体上相近的“狗”的向量中去,并且事实上更接近于“猫”的向量。相似的,卡车的图片最后离“卡车”向量相对也比较近,“卡车”向量和与它相关的“汽车”向量很近。
7cc829d3gw1elil71mwygj20wa0le12c.jpg
这个图是斯坦福一个小组用8个已知类(和2个未知类别)做的图。结果已经很可观了。但因为已知类数目小,能够用来插入图像和语义空间的关系的点就很少了。
差不多同时期,Google的小组做了一个大得多的版本,他们用了1000个类别而不是8个(Frome et al. (2013))。之后他们又做了一个新的版本(Norouzi et al.(2014))。两者都基于非常有效的图像分类模型(来自 Krizehvsky et al.(2012)),但它们使用了不同的方式把图像嵌入到单词嵌入空间去。
他们的成果是很赞的。虽然他们不能把未知类的图片准确放到代表这个类的向量上去,但是他们能够把它放到正确的区域。所以,如果你用它来对区别比较大的未知类的图片来分类,它是能够区分类别的不同的。
即使我从来没见过艾斯库拉普蛇和穿山甲,如果你给我看这两样东西的照片,我能告诉你哪个是哪个因为我大致知道这两个词和什么样的动物有关。这些网络可以做到同样的事情。
(这些结果都利用到一种“这些词是相似的”的推断。但是看起来根据词之前的关系应该有更有力的结果。在我们的单词嵌入空间里,在男性和女性词上有一个一致的差异向量。相似的,在图像空间中,也有一致的可以区分男性和女性的特征。胡子,八字胡,秃顶都是强烈的,可见的男性特征。胸部,及没那么可靠的如长发,化妆品及珠宝这些是明显的女性特征(参见注解6)。即使你从来没见过一个国王,如果一个带着王冠的王后突然有了胡子,那把她变成男人也是很合理的。)
共享嵌入是一个非常让人兴奋的研究领域,它暗示着为何深度学习中这个注重表征方法的角度是如此的引人入胜。
递归神经网络
我们之前是用下面这个网络开始谈单词嵌入的:
7cc829d3gw1elil72k3j9j20dt0970t7.jpg
学习单词嵌入的模块化网络(来自Bottou (2011))
上面的图描绘了一个模块化网络,R(W(w1), W(w2), W(w3), W(w4), W(w5))。它是由两个模块构建的,W和R。这个用能拼在一起的小一些的神经网络模块来构建神经网络的方法传播并不是十分广泛。然而,在NLP中它很有效。
像上面那样的模型很有效,但很不幸它们有个局限:输入参数的个数必须是固定的。
7cc829d3gw1elil736cwzj20fk07hmxm.jpg
(来自 Bottou (2011))
我们可以通过加入一个关联模块A来解决这个问题。这个关联模块可以将两个单词或词组的表征合并起来。
通过合并一系列的单词,A让我们不仅能够表示单词,而且能够表示词组甚至整个句子!另外因为我们可以合并不同数量的单词,我们就可以不固定死输入的个数了。
把句子中的单词线性地合并在一起的做法并不是在所有情况下都讲得通。考虑下面这个句子“the cat sat on the mat”,很自然地它可以被分成下面这样用括号分开的不同的段:“((the cat) (sat (on (the mat))”. 我们可以把A应用在这个分段上:
7cc829d3gw1elil73tbr9j20tm084gml.jpg
(来自Bottou (2011))
这样的模型通常被称作“递归神经网络”因为一个模块经常会使用另外一个同类型模块的输出。有时候它们也被称作“树形神经网络tree-structured neural networks”。
递归神经网络在一系列NLP任务中都有很重大的成功。比如Socher et al. (2013c) 就利用了一个递归神经网络来预测句子的情感:
7cc829d3gw1elil74ijmlj20wn0jq40m.jpg
一直以来,一个很主要的目标是如何创建一个可逆的句子表征(sentence representation),也就是能够通过这个表征来构建一个真正的有着相似意思的句子。例如,我们可以尝试引入一个分解模块(disassociation module)D来试着把A分解了:
7cc829d3gw1elil753cjdj211k08dabf.jpg
(来自 Bottou (2011))
如果这个能成功,将会是一个极其强大的工具。举个例子,我们可以尝试做一个双语句子表征然后把它用在翻译任务上。
不幸的是,这个实际上是很难实现的。非常,非常难。同时因为它一旦成功有巨大的前途,有很多人在为研究它而努力。
最近,Cho et al. (2014)在词组表征上有了一些进展,他们做了一个能把英语词组编码,解码成法语的模型。来看看它学习出来的词组表征吧!
7cc829d3gw1elil766tytj214i0qhdo0.jpg
词组表征的t-SNE的一小部分(来自Cho et al. (2014))
批判
有关上面我们综述的一些结果,我也听说有其他领域的研究人员,尤其是NLP和语言学的人,对他们进行了批判。他们的顾虑倒不是针对结果本身的,反而是从结果中得出的结论以及他们和其他方法的区别。
我觉得自己的能力不足以清晰的表达这些顾虑。我鼓励有能力的人在(英文原文)评论里描述这些顾虑。
结论
深度学习中的表征视角是非常有力的,也似乎能够解答为何深度神经网络如此有效。在此之上,我认为它还有一个极美的地方:为何神经网络有效?因为在优化多层模型的过程中,更好的来数据表征方法会自动浮现出来。
深度学习是个非常年轻的领域,理论根基还不强,观点也在快速地改变。我感觉神经网络中重视表征的这个方面目前是十分流行的。
在这篇文章里,我综述了一些我觉得十分让人兴奋的研究成果,但我写这篇文章的主要动力是为之后要写的一篇探索深度学习,类型论(type theory)和功能性编程(functional programming)之间关系的文章铺路。如果你感兴趣的话,可以订阅我的RSS(原文作者),这样文章发布时你就能看见了。
(我很乐意听听你们的想法和评论。如果针对英文原文你发现了错别字,技术错误,或者你认为需要添加的修正或者澄清,欢迎到github来pull。)
致谢
我很感激Eliana Lorch、Yoshua Bengio、Michael Nielsen、Laura Ball、Rob Gilson 及 Jacob Steinhardt 的评论和支持。
注解:
- 当你有n个输入神经元时,构建所有可能的输入情况需要2^n个隐神经元。在实际操作中,通常不会这么严重。你可以采取能够包含不同输入的情况。你也可以采用重叠的情况,他们利用叠加的方式来在交集处获得正确的输入。
- 不仅是感知器网络才有普适性。多层感知器(sigmoid neurons)网络(及其他激发函数)也具有普适性:给予足够的隐节点,他们估算任何连续函数都可以得到不错的结果。因为你不能简单地孤立输入,所以想看明白这点是十分复杂的。
- 单词嵌入最初是由(Bengio et al, 2001; Bengio et al, 2003)开发的。那是2006年深度学习重构开始的前几年,那时神经网络被认为是过时的。而符号话的向量表示(distributed representations)的概念就更老了,比如(Hinton 1986)。
- 这篇开创性的文章:A Neural Probabilistic Language Model (Bengio, et al. 2003)里包含了很多单词嵌入为何有力的解释。
- 之前也有对图像和标签联合分布建模的工作,但他们的观点和我们要描述的截然不同。
- 我十分清楚利用性别的外表特征可能是十分误导人的。来暗示诸如每个秃头的人都是男性或者每个有胸部的人都是女性并不是我的本意。只是这些是通常的情况而已,而它们可以用来很大程度上的调节我们的先验知识。
Deep Learning, NLP, and Representations
Posted on July 7, 2014
Introduction
In the last few years, deep neural networks have dominated pattern recognition. They blew the previous state of the art out of the water for many computer vision tasks. Voice recognition is also moving that way.
But despite the results, we have to wonder… why do they work so well?
This post reviews some extremely remarkable results in applying deep neural networks to natural language processing (NLP). In doing so, I hope to make accessible one promising answer as to why deep neural networks work. I think it’s a very elegant perspective.
One Hidden Layer Neural Networks
A neural network with a hidden layer has universality: given enough hidden units, it can approximate any function. This is a frequently quoted – and even more frequently, misunderstood and applied – theorem.
It’s true, essentially, because the hidden layer can be used as a lookup table.
For simplicity, let’s consider a perceptron network. A perceptron is a very simple neuron that fires if it exceeds a certain threshold and doesn’t fire if it doesn’t reach that threshold. A perceptron network gets binary (0 and 1) inputs and gives binary outputs.
Note that there are only a finite number of possible inputs. For each possible input, we can construct a neuron in the hidden layer that fires for that input,1 and only on that specific input. Then we can use the connections between that neuron and the output neurons to control the output in that specific case. 2
flowchart-PerceptronLookup.png
And so, it’s true that one hidden layer neural networks are universal. But there isn’t anything particularly impressive or exciting about that. Saying that your model can do the same thing as a lookup table isn’t a very strong argument for it. It just means it isn’t impossible for your model to do the task.
Universality means that a network can fit to any training data you give it. It doesn’t mean that it will interpolate to new data points in a reasonable way.
No, universality isn’t an explanation for why neural networks work so well. The real reason seems to be something much more subtle… And, to understand it, we’ll first need to understand some concrete results.
Word Embeddings
I’d like to start by tracing a particularly interesting strand of deep learning research: word embeddings. In my personal opinion, word embeddings are one of the most exciting area of research in deep learning at the moment, although they were originally introduced by Bengio, et al. more than a decade ago.3 Beyond that, I think they are one of the best places to gain intuition about why deep learning is so effective.
A word embedding W:words→ℝn is a paramaterized function mapping words in some language to high-dimensional vectors (perhaps 200 to 500 dimensions). For example, we might find:
W(‘‘cat")=(0.2, -0.4, 0.7, ...)
W(‘‘mat")=(0.0, 0.6, -0.1, ...)
(Typically, the function is a lookup table, parameterized by a matrix, θ, with a row for each word: Wθ(wn)=θn.)
W is initialized to have random vectors for each word. It learns to have meaningful vectors in order to perform some task.
For example, one task we might train a network for is predicting whether a 5-gram (sequence of five words) is ‘valid.’ We can easily get lots of 5-grams from Wikipedia (eg. “cat sat on the mat”) and then ‘break’ half of them by switching a word with a random word (eg. “cat sat song the mat”), since that will almost certainly make our 5-gram nonsensical.
Bottou-WordSetup.png
Modular Network to determine if a 5-gram is ‘valid’ (From Bottou (2011))
The model we train will run each word in the 5-gram through W to get a vector representing it and feed those into another ‘module’ called R which tries to predict if the 5-gram is ‘valid’ or ‘broken.’ Then, we’d like:
R(W(‘‘cat"), W(‘‘sat"), W(‘‘on"), W(‘‘the"), W(‘‘mat"))=1
R(W(‘‘cat"), W(‘‘sat"), W(‘‘song"), W(‘‘the"), W(‘‘mat"))=0
In order to predict these values accurately, the network needs to learn good parameters for both W and R.
Now, this task isn’t terribly interesting. Maybe it could be helpful in detecting grammatical errors in text or something. But what is extremely interesting is W.
(In fact, to us, the entire point of the task is to learn W. We could have done several other tasks – another common one is predicting the next word in the sentence. But we don’t really care. In the remainder of this section we will talk about many word embedding results and won’t distinguish between different approaches.)
One thing we can do to get a feel for the word embedding space is to visualize them with t-SNE, a sophisticated technique for visualizing high-dimensional data.
Turian-WordTSNE.png
t-SNE visualizations of word embeddings. Left: Number Region; Right: Jobs Region. From Turian et al. (2010), see complete image.
This kind of ‘map’ of words makes a lot of intuitive sense to us. Similar words are close together. Another way to get at this is to look at which words are closest in the embedding to a given word. Again, the words tend to be quite similar.
Colbert-WordTable2.png
What words have embeddings closest to a given word? From Collobert et al. (2011)
It seems natural for a network to make words with similar meanings have similar vectors. If you switch a word for a synonym (eg. “a few people sing well” → “a couple people sing well”), the validity of the sentence doesn’t change. While, from a naive perspective, the input sentence has changed a lot, if W maps synonyms (like “few” and “couple”) close together, from R’s perspective little changes.
This is very powerful. The number of possible 5-grams is massive and we have a comparatively small number of data points to try to learn from. Similar words being close together allows us to generalize from one sentence to a class of similar sentences. This doesn’t just mean switching a word for a synonym, but also switching a word for a word in a similar class (eg. “the wall is blue” → “the wall is red”). Further, we can change multiple words (eg. “the wall is blue” → “the ceiling is red”). The impact of this is exponential with respect to the number of words.4
So, clearly this is a very useful thing for W to do. But how does it learn to do this? It seems quite likely that there are lots of situations where it has seen a sentence like “the wall is blue” and know that it is valid before it sees a sentence like “the wall is red”. As such, shifting “red” a bit closer to “blue” makes the network perform better.
We still need to see examples of every word being used, but the analogies allow us to generalize to new combinations of words. You’ve seen all the words that you understand before, but you haven’t seen all the sentences that you understand before. So too with neural networks.
Mikolov-GenderVecs.png
Word embeddings exhibit an even more remarkable property: analogies between words seem to be encoded in the difference vectors between words. For example, there seems to be a constant male-female difference vector:
W(‘‘woman")−W(‘‘man") ≃ W(‘‘aunt")−W(‘‘uncle")
W(‘‘woman")−W(‘‘man") ≃ W(‘‘queen")−W(‘‘king")
This may not seem too surprising. After all, gender pronouns mean that switching a word can make a sentence grammatically incorrect. You write, “she is the aunt” but “he is the uncle.” Similarly, “he is the King” but “she is the Queen.” If one sees “she is the uncle,” the most likely explanation is a grammatical error. If words are being randomly switched half the time, it seems pretty likely that happened here.
“Of course!” We say with hindsight, “the word embedding will learn to encode gender in a consistent way. In fact, there’s probably a gender dimension. Same thing for singular vs plural. It’s easy to find these trivial relationships!”
It turns out, though, that much more sophisticated relationships are also encoded in this way. It seems almost miraculous!
Mikolov-AnalogyTable.png
Relationship pairs in a word embedding. From Mikolov et al. (2013b).
It’s important to appreciate that all of these properties of W are side effects. We didn’t try to have similar words be close together. We didn’t try to have analogies encoded with difference vectors. All we tried to do was perform a simple task, like predicting whether a sentence was valid. These properties more or less popped out of the optimization process.
This seems to be a great strength of neural networks: they learn better ways to represent data, automatically. Representing data well, in turn, seems to be essential to success at many machine learning problems. Word embeddings are just a particularly striking example of learning a representation.
Shared Representations
The properties of word embeddings are certainly interesting, but can we do anything useful with them? Besides predicting silly things, like whether a 5-gram is ‘valid’?
flowchart-TranfserLearning2.png
W and F learn to perform task A. Later, G can learn to perform B based on W.
We learned the word embedding in order to do well on a simple task, but based on the nice properties we’ve observed in word embeddings, you may suspect that they could be generally useful in NLP tasks. In fact, word representations like these are extremely important:
The use of word representations… has become a key “secret sauce” for the success of many NLP systems in recent years, across tasks including named entity recognition, part-of-speech tagging, parsing, and semantic role labeling. (Luong et al. (2013))
This general tactic – learning a good representation on a task A and then using it on a task B – is one of the major tricks in the Deep Learning toolbox. It goes by different names depending on the details: pretraining, transfer learning, and multi-task learning. One of the great strengths of this approach is that it allows the representation to learn from more than one kind of data.
There’s a counterpart to this trick. Instead of learning a way to represent one kind of data and using it to perform multiple kinds of tasks, we can learn a way to map multiple kinds of data into a single representation!
One nice example of this is a bilingual word-embedding, produced in Socher et al. (2013a). We can learn to embed words from two different languages in a single, shared space. In this case, we learn to embed English and Mandarin Chinese words in the same space.
flowchart-bilingual.png
We train two word embeddings, Wen and Wzh in a manner similar to how we did above. However, we know that certain English words and Chinese words have similar meanings. So, we optimize for an additional property: words that we know are close translations should be close together.
Of course, we observe that the words we knew had similar meanings end up close together. Since we optimized for that, it’s not surprising. More interesting is that words we didn’t know were translations end up close together.
In light of our previous experiences with word embeddings, this may not seem too surprising. Word embeddings pull similar words together, so if an English and Chinese word we know to mean similar things are near each other, their synonyms will also end up near each other. We also know that things like gender differences tend to end up being represented with a constant difference vector. It seems like forcing enough points to line up should force these difference vectors to be the same in both the English and Chinese embeddings. A result of this would be that if we know that two male versions of words translate to each other, we should also get the female words to translate to each other.
Intuitively, it feels a bit like the two languages have a similar ‘shape’ and that by forcing them to line up at different points, they overlap and other points get pulled into the right positions.
Socher-BillingualTSNE.png
t-SNE visualization of the bilingual word embedding. Green is Chinese, Yellow is English. (Socher et al. (2013a))
In bilingual word embeddings, we learn a shared representation for two very similar kinds of data. But we can also learn to embed very different kinds of data in the same space.
flowchart-DeViSE.png
Recently, deep learning has begun exploring models that embed images and words in a single representation.5
The basic idea is that one classifies images by outputting a vector in a word embedding. Images of dogs are mapped near the “dog” word vector. Images of horses are mapped near the “horse” vector. Images of automobiles near the “automobile” vector. And so on.
The interesting part is what happens when you test the model on new classes of images. For example, if the model wasn’t trained to classify cats – that is, to map them near the “cat” vector – what happens when we try to classify images of cats?
Socher-ImageClassManifold.png
It turns out that the network is able to handle these new classes of images quite reasonably. Images of cats aren’t mapped to random points in the word embedding space. Instead, they tend to be mapped to the general vicinity of the “dog” vector, and, in fact, close to the “cat” vector. Similarly, the truck images end up relatively close to the “truck” vector, which is near the related “automobile” vector.
Socher-ImageClass-tSNE.png
This was done by members of the Stanford group with only 8 known classes (and 2 unknown classes). The results are already quite impressive. But with so few known classes, there are very few points to interpolate the relationship between images and semantic space off of.
The Google group did a much larger version – instead of 8 categories, they used 1,000 – around the same time (Frome et al. (2013)) and has followed up with a new variation (Norouzi et al.(2014)). Both are based on a very powerful image classification model (from Krizehvsky et al.(2012)), but embed images into the word embedding space in different ways.
The results are impressive. While they may not get images of unknown classes to the precise vector representing that class, they are able to get to the right neighborhood. So, if you ask it to classify images of unknown classes and the classes are fairly different, it can distinguish between the different classes.
Even though I’ve never seen a Aesculapian snake or an Armadillo before, if you show me a picture of one and a picture of the other, I can tell you which is which because I have a general idea of what sort of animal is associated with each word. These networks can accomplish the same thing.
(These results all exploit a sort of “these words are similar” reasoning. But it seems like much stronger results should be possible based on relationships between words. In our word embedding space, there is a consistent difference vector between male and female version of words. Similarly, in image space, there are consistent features distinguishing between male and female. Beards, mustaches, and baldness are all strong, highly visible indicators of being male. Breasts and, less reliably, long hair, makeup and jewelery, are obvious indicators of being female.6 Even if you’ve never seen a king before, if the queen, determined to be such by the presence of a crown, suddenly has a beard, it’s pretty reasonable to give the male version.)
Shared embeddings are an extremely exciting area of research and drive at why the representation focused perspective of deep learning is so compelling.
Recursive Neural Networks
We began our discussion of word embeddings with the following network:
Bottou-WordSetup.png
Modular Network that learns word embeddings (From Bottou (2011))
The above diagram represents a modular network, R(W(w1), W(w2), W(w3), W(w4), W(w5)). It is built from two modules, W and R. This approach, of building neural networks from smaller neural network “modules” that can be composed together, is not very wide spread. It has, however, been very successful in NLP.
Models like the above are powerful, but they have an unfortunate limitation: they can only have a fixed number of inputs.
We can overcome this by adding an association module, A, which will take two word or phrase representations and merge them.
Bottou-Afold.png
(From Bottou (2011))
By merging sequences of words, A takes us from representing words to representing phrases or even representing whole sentences! And because we can merge together different numbers of words, we don’t have to have a fixed number of inputs.
It doesn’t necessarily make sense to merge together words in a sentence linearly. If one considers the phrase “the cat sat on the mat”, it can naturally be bracketed into segments: “((the cat) (sat (on (the mat))))”. We can apply A based on this bracketing:
Bottou-Atree.png
(From Bottou (2011))
These models are often called “recursive neural networks” because one often has the output of a module go into a module of the same type. They are also sometimes called “tree-structured neural networks.”
Recursive neural networks have had significant successes in a number of NLP tasks. For example, Socher et al. (2013c) uses a recursive neural network to predict sentence sentiment:
Socher-SentimentTree.png
(From Socher et al. (2013c))
One major goal has been to create a reversible sentence representation, a representation that one can reconstruct an actual sentence from, with roughly the same meaning. For example, we can try to introduce a disassociation module, D, that tries to undo A:
Bottou-unfold.png
(From Bottou (2011))
If we could accomplish such a thing, it would be an extremely powerful tool. For example, we could try to make a bilingual sentence representation and use it for translation.
Unfortunately, this turns out to be very difficult. Very very difficult. And given the tremendous promise, there are lots of people working on it.
Recently, Cho et al. (2014) have made some progress on representing phrases, with a model that can encode English phrases and decode them in French. Look at the phrase representations it learns!
Cho-TimePhrase-TSNE.png
Small section of the t-SNE of the phrase representation
(From Cho et al. (2014))
Criticisms
I’ve heard some of the results reviewed above criticized by researchers in other fields, in particular, in NLP and linguistics. The concerns are not with the results themselves, but the conclusions drawn from them, and how they compare to other techniques.
I don’t feel qualified to articulate these concerns. I’d encourage someone who feels this way to describe the concerns in the comments.
Conclusion
The representation perspective of deep learning is a powerful view that seems to answer why deep neural networks are so effective. Beyond that, I think there’s something extremely beautiful about it: why are neural networks effective? Because better ways of representing data can pop out of optimizing layered models.
Deep learning is a very young field, where theories aren’t strongly established and views quickly change. That said, it is my impression that the representation-focused perspective of neural networks is presently very popular.
This post reviews a lot of research results I find very exciting, but my main motivation is to set the stage for a future post exploring connections between deep learning, type theory and functional programming. If you’re interested, you can subscribe to my rss feed so that you’ll see it when it is published.
(I would be delighted to hear your comments and thoughts: you can comment inline or at the end. For typos, technical errors, or clarifications you would like to see added, you are encouraged to make a pull request on github)
Acknowledgments
I’m grateful to Eliana Lorch, Yoshua Bengio, Michael Nielsen, Laura Ball, Rob Gilson, and Jacob Steinhardt for their comments and support.
- Constructing a case for every possible input requires 2n hidden neurons, when you have ninput neurons. In reality, the situation isn’t usually that bad. You can have cases that encompass multiple inputs. And you can have overlapping cases that add together to achieve the right input on their intersection.↩
- (It isn’t only perceptron networks that have universality. Networks of sigmoid neurons (and other activation functions) are also universal: give enough hidden neurons, they can approximate any continuous function arbitrarily well. Seeing this is significantly trickier because you can’t just isolate inputs.)↩
- Word embeddings were originally developed in (Bengio et al, 2001;Bengio et al, 2003), a few years before the 2006 deep learning renewal, at a time when neural networks were out of fashion. The idea of distributed representations for symbols is even older, e.g. (Hinton 1986)."↩
- The seminal paper, A Neural Probabilistic Language Model (Bengio, et al. 2003) has a great deal of insight about why word embeddings are powerful.↩
- Previous work has been done modeling the joint distributions of tags and images, but it took a very different perspective.↩
- I’m very conscious that physical indicators of gender can be misleading. I don’t mean to imply, for example, that everyone who is bald is male or everyone who has breasts is female. Just that these often indicate such, and greatly adjust our prior.↩
Subscribe to the RSS feed.