zoukankan      html  css  js  c++  java
  • Urllib.request 抓取网页html

    语法 urllib.request.urlopen 

    意思就是打开 url 

    # 导入urllib
    import urllib.request
    
    # 打开url
    response = urllib.request.urlopen('https://movie.douban.com/', None, 10)
    # 读取返回的内容
    html = response.read().decode('utf-8')
    # 写入txt
    with open('html','w',encoding='utf-8') as f:
        f.write(html)

    就是打开一个网页,并保存下来,读取信息,进行解码操作后,写入txt

    但是弹出了错误:urllib.error.HTTPError: HTTP Error 418: 

    解决方法:

    在url中加入头部

    用fiddler工具抓包。找到headers包。获取他的请求头

    User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36

    代码如下:

    # 导入urllib
    import urllib.request
    # 定义一个头部
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36'}
    # 给url加头部
    _url = urllib.request.Request('https://movie.douban.com/',headers=headers)
    # 打开url
    response = urllib.request.urlopen(_url, None, 10)
    # 读取返回的内容
    html = response.read().decode('utf-8')
    # 写入txt
    with open('html','w',encoding='utf-8') as f:
        f.write(html)
  • 相关阅读:
    urlencode 和 rawurlencode 的区别
    magic_quotes_gpc
    变量的值是多少
    git diff patch
    drupal前端开发的第一点
    git drupal eclipse
    spm总结
    features block
    alu features menu
    git reset 理解
  • 原文地址:https://www.cnblogs.com/lijunlin-py/p/14916351.html
Copyright © 2011-2022 走看看