垃圾邮件分类的预处理和标准化

zoukankan html css js c++ java

垃圾邮件分类的预处理和标准化
主要包括：

• Lower-casing: The entire email is converted into lower case, so that captialization is ignored (e.g., IndIcaTE is treated the same as Indicate).
• Stripping HTML: All HTML tags are removed from the emails.Many emails often come with HTML formatting; we remove all the HTML tags, so that only the content remains.
• Normalizing URLs: All URLs are replaced with the text httpaddr".
• Normalizing Email Addresses: All email addresses are replaced with the text emailaddr".
• Normalizing Numbers: All numbers are replaced with the text umber".
• Normalizing Dollars: All dollar signs ($) are replaced with the text dollar".
• Word Stemming: Words are reduced to their stemmed form. For example, discount", discounts", discounted" and discounting" are all replaced with discount". Sometimes, the Stemmer actually strips additional characters from the end, so include", includes", included",and including" are all replaced with includ".
• Removal of non-words: Non-words and punctuation have been removed. All white spaces (tabs, newlines, spaces) have all been trimmed to a single space character.

对于的matlab代码，这简洁的！
% Lower case email_contents = lower(email_contents); % Strip all HTML % Looks for any expression that starts with < and ends with > and replace % and does not have any < or > in the tag it with a space email_contents = regexprep(email_contents, '<[^<>]+>', ' '); % Handle Numbers % Look for one or more characters between 0-9 email_contents = regexprep(email_contents, '[0-9]+', 'number'); % Handle URLS % Look for strings starting with http:// or https:// email_contents = regexprep(email_contents, ... '(http|https)://[^s]*', 'httpaddr'); % Handle Email Addresses % Look for strings with @ in the middle email_contents = regexprep(email_contents, '[^s]+@[^s]+', 'emailaddr'); % Handle $ sign email_contents = regexprep(email_contents, '[$]+', 'dollar');
来源：machine learning-Andrew Ng, https://www.coursera.org/learn/machine-learning/programming/e4hZk/support-vector-machines
查看全文

相关阅读:
登陆验证前对用户名和密码加密之后传输数据---base64加密
 用HTML5实现的各种排序算法的动画比较及算法小结
 jquery mobile 请求数据方法执行时显示加载中提示框
 settimeout如何调用方法的时候，传递参数
 为什么html5用的jQuery Mobile在手机浏览器/微信中打开字体很小
 MVC中view页面用jquery方法绑定select控件值
 查看谷歌浏览器保存的本地密码，临时表创建索引
 docker安装mysql
安装docker
charles模拟弱网操作

原文地址：https://www.cnblogs.com/gui0901/p/5241753.html