zoukankan      html  css  js  c++  java
  • 垃圾邮件分类的预处理和标准化

    主要包括:

    • Lower-casing: The entire email is converted into lower case, so that captialization is ignored (e.g., IndIcaTE is treated the same as Indicate).
    • Stripping HTML: All HTML tags are removed from the emails.Many emails often come with HTML formatting; we remove all the HTML tags, so that only the content remains.
    • Normalizing URLs: All URLs are replaced with the text httpaddr".
    • Normalizing Email Addresses: All email addresses are replaced with the text emailaddr".
    • Normalizing Numbers: All numbers are replaced with the text umber".
    • Normalizing Dollars: All dollar signs ($) are replaced with the text dollar".
    • Word Stemming: Words are reduced to their stemmed form. For example, discount", discounts", discounted" and discounting" are all replaced with discount". Sometimes, the Stemmer actually strips additional characters from the end, so include", includes", included",and including" are all replaced with includ".
    • Removal of non-words: Non-words and punctuation have been removed. All white spaces (tabs, newlines, spaces) have all been trimmed to a single space character.

    对于的matlab代码,这简洁的!

    % Lower case
    email_contents = lower(email_contents);
    
    % Strip all HTML
    % Looks for any expression that starts with < and ends with > and replace
    % and does not have any < or > in the tag it with a space
    email_contents = regexprep(email_contents, '<[^<>]+>', ' ');
    
    % Handle Numbers
    % Look for one or more characters between 0-9
    email_contents = regexprep(email_contents, '[0-9]+', 'number');
    
    % Handle URLS
    % Look for strings starting with http:// or https://
    email_contents = regexprep(email_contents, ...
                               '(http|https)://[^s]*', 'httpaddr');
    
    % Handle Email Addresses
    % Look for strings with @ in the middle
    email_contents = regexprep(email_contents, '[^s]+@[^s]+', 'emailaddr');
    
    % Handle $ sign
    email_contents = regexprep(email_contents, '[$]+', 'dollar');

    来源:machine learning-Andrew Ng, https://www.coursera.org/learn/machine-learning/programming/e4hZk/support-vector-machines

  • 相关阅读:
    python Windows环境下文件路径问题
    pycharm 取消连按两下shift出现的全局搜索
    python2 与 python3的区别
    Python安装PyOpenGL
    Protobuffer学习文档
    python bin文件处理
    python 项目自动生成requirements.txt文件
    pytest文档7-pytest-html生成html报告
    python from __future__ import division
    细说 Java 的深拷贝和浅拷贝
  • 原文地址:https://www.cnblogs.com/gui0901/p/5241753.html
Copyright © 2011-2022 走看看