zoukankan      html  css  js  c++  java
  • Python语言总结 4.2. 和字符串(str,unicode等)处理有关的函数

    4.2.7. 去除控制字符:removeCtlChr

    Python语言总结
    4.2. 和字符串(str,unicode等)处理有关的函数
    Sidebar     Prev | Up | Next
    4.2.7. 去除控制字符:removeCtlChr

    使得处理后的字符串,在XML都是合法的了。

    #------------------------------------------------------------------------------
    # remove control character from input string
    # otherwise will cause wordpress importer import failed
    # for wordpress importer, if contains contrl char, will fail to import wxr
    # eg:
    # 1. http://againinput4.blog.163.com/blog/static/172799491201110111145259/
    # content contains some invalid ascii control chars
    # 2. http://hi.baidu.com/notebookrelated/blog/item/8bd88e351d449789a71e12c2.html
    # 165th comment contains invalid control char: ETX
    # 3. http://green-waste.blog.163.com/blog/static/32677678200879111913911/
    # title contains control char:DC1, BS, DLE, DLE, DLE, DC1
    def removeCtlChr(inputString) :
        validContent = '';
        for c in inputString :
            asciiVal = ord(c);
            validChrList = [
                9, # 9= =tab
                10, # 10= =LF=Line Feed=换行
                13, # 13= =CR=回车
            ];
            # filter out others ASCII control character, and DEL=delete
            isValidChr = True;
            if (asciiVal == 0x7F) :
                isValidChr = False;
            elif ((asciiVal < 32) and (asciiVal not in validChrList)) :
                isValidChr = False;
           
            if(isValidChr) :
                validContent += c;

        return validContent;
           

    Example 4.11. removeCtlChr的使用范例

    # remove the control char in title:
    # eg;
    # http://green-waste.blog.163.com/blog/static/32677678200879111913911/
    # title contains control char:DC1, BS, DLE, DLE, DLE, DC1
    infoDict['title'] = removeCtlChr(infoDict['title']);
               

    [Tip]     关于控制字符

    如果不了解什么是控制字符,请参考:ASCII字符集中的功能/控制字符
    Prev      Up      Next
    4.2.6. 去除非单词(non-word)的字符:removeNonWordChar      Home      4.2.8. 将字符实体替换为Unicode数字实体:replaceStrEntToNumEnt

        Contents
        Search

    loading table of contents...
    Search
     

    Search Highlighter (On/Off)
  • 相关阅读:
    Java之路---Day09(继承)
    Java之路---Day08
    Java之路---Day07
    Java之路---Day06
    转载:js 创建对象、属性、方法
    Javascript类型检测
    jQuery 如何写插件
    js浮点数精度问题
    IE7.JS解决IE兼容性问题方法
    CSS 中文字体的英文名称 (simhei, simsun) 宋体 微软雅黑
  • 原文地址:https://www.cnblogs.com/lexus/p/3323632.html
Copyright © 2011-2022 走看看