zoukankan      html  css  js  c++  java
  • url编解码

    代码

    from urllib.parse import quote,unquote,urlencode
    
    
    print(quote('https://www.cnblogs.com/?a=bc&d=f'))
    print(urlencode({'a':'b','b':'c'}))
    #
    https%3A//www.cnblogs.com/%3Fa%3Dbc%26d%3Df

    #a=b&b=c


    print(unquote('https%3A//www.cnblogs.com/%3Fa%3Dbc%26d%3Df')) print(unquote('a=b&b=c')) #编码 #quote操作的是字符串类型,把url的参数和特殊字符都进行编码 #urlencode操作对象是字典类型,或者列表套元组 #解码 #只有unqoute,没有urldecode #所以解码只用unqoute

     对于编码,从上面我们能看到,http协议跟着的冒号也会被编码,唯独‘/’不会被编码,这对爬虫会进行很大的困扰,我们看下他的源码

    def quote(string, safe='/', encoding=None, errors=None):
        """quote('abc def') -> 'abc%20def'
    
        Each part of a URL, e.g. the path info, the query, etc., has a
        different set of reserved characters that must be quoted. The
        quote function offers a cautious (not minimal) way to quote a
        string for most of these parts.
    
        RFC 3986 Uniform Resource Identifier (URI): Generic Syntax lists
        the following (un)reserved characters.
    
        unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
        reserved      = gen-delims / sub-delims
        gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
        sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                      / "*" / "+" / "," / ";" / "="
    
        Each of the reserved characters is reserved in some component of a URL,
        but not necessarily in all of them.
    
        The quote function %-escapes all characters that are neither in the
        unreserved chars ("always safe") nor the additional chars set via the
        safe arg.
    
        The default for the safe arg is '/'. The character is reserved, but in
        typical usage the quote function is being called on a path where the
        existing slash characters are to be preserved.
    
        Python 3.7 updates from using RFC 2396 to RFC 3986 to quote URL strings.
        Now, "~" is included in the set of unreserved characters.
    
        string and safe may be either str or bytes objects. encoding and errors
        must not be specified if string is a bytes object.
    
        The optional encoding and errors parameters specify how to deal with
        non-ASCII characters, as accepted by the str.encode method.
        By default, encoding='utf-8' (characters are encoded with UTF-8), and
        errors='strict' (unsupported characters raise a UnicodeEncodeError).
        """
        if isinstance(string, str):
            if not string:
                return string
            if encoding is None:
                encoding = 'utf-8'
            if errors is None:
                errors = 'strict'
            string = string.encode(encoding, errors)
        else:
            if encoding is not None:
                raise TypeError("quote() doesn't support 'encoding' for bytes")
            if errors is not None:
                raise TypeError("quote() doesn't support 'errors' for bytes")
        return quote_from_bytes(string, safe)

      也就说默认不会变编码的只有下面这四个符号不会被编码,其他的都会被编码

    ‘_.-~’

      还有就是传入safe参数的字符也不会被编码,效果如下

    quote('https://www.cnblogs.com/?a=bc&d=f')
    'https%3A//www.cnblogs.com/%3Fa%3Dbc%26d%3Df'

    quote('https://www.cnblogs.com/?a=bc&d=f',safe=':/') 'https://www.cnblogs.com/%3Fa%3Dbc%26d%3Df'

    quote('https://www.cnblogs.com/?a=bc&d=f',safe=':?/') 'https://www.cnblogs.com/?a%3Dbc%26d%3Df'

    quote('https://www.cnblogs.com/?a=bc&d=f',safe=':?=/') 'https://www.cnblogs.com/?a=bc%26d=f'

      源码默认的safe只有 ‘/’,但你传入safe,如果需要‘/’不被编码,也要记得传入’/‘,

  • 相关阅读:
    ubuntu下安装oracle
    网站框架策划时的小技巧--页面原型篇
    中国电商价格欺诈何时休?
    系统升级日记(4):如何快速的修改Infopath中的各种URL
    系统升级日记(3)- 升级SharePoint解决方案和Infopath
    系统升级日记(2)- 升级到SharePoint Server 2013
    系统升级日记(1)- 升级到SQL Server 2012
    【译】《C# Tips -- Write Better C#》
    [.NET] 一步步打造一个简单的 MVC 电商网站
    反骨仔的 2016 年度全文目录索引
  • 原文地址:https://www.cnblogs.com/tjp40922/p/12617020.html
Copyright © 2011-2022 走看看