The bagofwords model

zoukankan html css js c++ java

The bagofwords model

（源自：http://en.wikipedia.org/wiki/Bag_of_words_model）

The bag-of-words model is a simplifying assumption used in natural language processing and information retrieval. In this model, a text (such as a sentence or a document) is represented as an unordered collection of words, disregarding grammar and even word order.

   词袋模型是在自然语言处理和信息检索中的一种简单假设。在这种模型中，文本（段落或者文档）被看作是无序的词汇集合，忽略语法甚至是单词的顺序。

   The bag-of-words model is used in some methods of document classification. When a Naive Bayes classifier is applied to text, for example, the conditional independence assumption leads to the bag-of-words model. [1] Other methods of document classification that use this model are latent Dirichlet allocation and latent semantic analysis.[2]

   词袋模型被用在文本分类的一些方法当中。当传统的贝叶斯分类被应用到文本当中时，贝叶斯中的条件独立性假设导致词袋模型。另外一些文本分类方法如LDA和LSA也使用了这个模型。


   Example: Spam filtering
   In Bayesian spam filtering, an e-mail message is modeled as an unordered collection of words selected from one of two probability distributions: one representing spam and one representing legitimate e-mail ("ham"). Imagine that there are two literal bags full of words. One bag is filled with words found in spam messages, and the other bag is filled with words found in legitimate e-mail. While any given word is likely to be found somewhere in both bags, the "spam" bag will contain spam-related words such as "stock", "Viagra", and "buy" much more frequently, while the "ham" bag will contain more words related to the user's friends or workplace.

   在贝叶斯垃圾邮件过滤中，一封邮件被看作无序的词汇集合，这些词汇从两种概率分布中被选出。一个代表垃圾邮件，一个代表合法的电子邮件。这里假设有两个装满词汇的袋子。一个袋子里面装的是在垃圾邮件中发现的词汇。另一个袋子装的是合法邮件中的词汇。尽管给定的一个词可能出现在两个袋子中，装垃圾邮件的袋子更有可能包含垃圾邮件相关的词汇，如股票，伟哥，“买”，而合法的邮件更可能包含邮件用户的朋友和工作地点的词汇。

    To classify an e-mail message, the Bayesian spam filter assumes that the message is a pile of words that has been poured out randomly from one of the two bags, and uses Bayesian probability to determine which bag it is more likely to be.

    为了将邮件分类，贝叶斯邮件分类器假设邮件来自于两个词袋中中的一个，并使用贝叶斯概率条件概率来决定那个袋子更可能产生这样的一封邮件。

**************************************************************
我喜欢程序员，他们单纯、固执、容易体会到成就感；面对困难，能够不休不眠；面对压力，能够迎接挑战。他们也会感到困惑与傍徨，但每个程序员的心中都有一个比尔盖茨或是乔布斯的梦想，用智慧把属于自己的事业开创。其实我是一个程序员[=.=]

查看全文

相关阅读:
【C#】SuperSocket配置启动UDP服务器
 【UWB】DWM1000 室内定位串口协议说明
 【Unity3D】把相机视角放置到编辑器当前位置视角
 【DXP】如何在原理图中批量修改
 request中的gizp提交解析以及提交请求
 java基础知识----循环
 pymongo.errors.CursorNotFound: Cursor not found
xposed入门(二)---hook方法入参
 VulnHub靶场篇9-SkyTower: 1
VulnHub靶场篇8-IMF:1

原文地址：https://www.cnblogs.com/kevinGaoblog/p/2497938.html