zoukankan      html  css  js  c++  java
  • python基础篇-爬虫urlparse使用及简单示例

    >>> from urllib.parse import urlparse
    >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
    >>> o   
    ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
                params='', query='', fragment='')
    >>> o.scheme
    'http'
    >>> o.port
    80
    >>> o.geturl()
    'http://www.cwi.nl:80/%7Eguido/Python.html'
    >>> from urllib.parse import urlparse
     >>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
     ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
                params='', query='', fragment='')
     >>> urlparse('www.cwi.nl/%7Eguido/Python.html')
     ParseResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html',
                params='', query='', fragment='')
     >>> urlparse('help/Python.html')
     ParseResult(scheme='', netloc='', path='help/Python.html', params='',
                query='', fragment='')
    

      

    Attribute

    Index

    Value

    Value if not present

    scheme

    0

    URL scheme specifier

    scheme parameter

    netloc

    1

    Network location part

    empty string

    path

    2

    Hierarchical path

    empty string

    params

    3

    Parameters for last path element

    empty string

    query

    4

    Query component

    empty string

    fragment

    5

    Fragment identifier

    empty string

    username

     

    User name

    None

    password

     

    Password

    None

    hostname

     

    Host name (lower case)

    None

    port

     

    Port number as integer, if present

    None

    >>>from urllib.parse import urljoin
    >>>urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
    'http://www.cwi.nl/%7Eguido/FAQ.html'
    >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html',
    ...         '//www.python.org/%7Eguido')
    'http://www.python.org/%7Eguido'
    

      

    >>>urllib.request.quote('http://www.baidu.com')
    'http%3A//www.baidu.com'
    >>>urllib.request.unquote('http%3A//www.baidu.com')
    'http://www.baidu.com'
    

      

    简单的demo示例

    思路如下:

    1. 爬取一个网页并将爬取到的内容读取出来赋给一个变量。
    2. 以写入的方式打开一个本地文件,命名为*.html等网页格式。
    3. 将步骤1中的变量写入该文件中。
    4. 关闭该文件
    import urllib.request
    import urllib.parse
    
    url='http://www.baidu.com'
    hearder={
    'User-Agent':'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    }
    
    request=urllib.request.Request(url,headers=header)
    reponse=urllib.request.urlopen(request).read()
    
    h=open("./1.html","wb")
    h.write(reponse)
    h.close()
    

     

    参考:https://docs.python.org/3/library/urllib.parse.html?highlight=urlparse#urllib.parse.urlparse

             https://blog.csdn.net/fengxinlinux/article/details/77281253

             https://www.runoob.com/python/python-func-open.html

  • 相关阅读:
    VisualVM工具的使用
    jstack的使用
    JVM内存溢出的定位与分析
    初识JVM
    JVM运行参数
    VIM 常用命令
    python3 简单抓取图片2
    python3 抓取图片
    node.js GET 请求简单案例
    node.js 爬虫
  • 原文地址:https://www.cnblogs.com/guanbin-529/p/12833766.html
Copyright © 2011-2022 走看看