Python简单爬虫 - 走看看

zoukankan html css js c++ java

Python简单爬虫
刚开始接触爬虫，网上找了代码，
1 import urllib.request 2 def getHtml(url): 3 page = urllib.request.urlopen('http://www.baidu.com') 4 html = page.read() 5 return html 6 7 html = getHtml("http://www.baidu.com") 8 9 print(html)
运行，直接错误！！！

第一次错误，错误信息：urllib.error.URLError: <urlopen error [WinError 10061] 由于目标计算机积极拒绝，无法连接。>

错误原因：电脑里安装的有蓝灯。。。导致的edge浏览器不能用。

第二次错误，得到的是乱码，错误信息：b'x1fx8bx08x00x00x00x00x00x02x03x9d|xfdsxe3Hvxd8xefxf9+(xccxaex06Xx81x10Ix8dxbex08xb5xa6xc4/x8d4xd2H#Qxf3xa5QX

错误原因：大多数网站都对支持gzip压缩的浏览器做了gzip的压缩，在python中可以通过gzip包处理gzip压缩过的网页。即内容被压缩过，不能直接decode，需要用gzip解压，再decode。

修改后代码：
1 import urllib.request 2 import gzip 3 def getHtml(url): 4 page = urllib.request.urlopen('http://www.baidu.com') 5 html = page.read() 6 return html 7 8 html = getHtml("http://www.baidu.com") 9 html = gzip.decompress(html) 10 html = html.decode('utf-8') 11 12 print(html)
能正确得到网页的HTML代码！
查看全文

相关阅读:
#Leetcode# 21. Merge Two Sorted Lists
#Leetcode# 118. Pascal's Triangle
#LeetCode# 136. Single Number
#Leetcode# 26. Remove Duplicates from Sorted Array
#LeetCode# 167. Two Sum II
#Leetcode# 58. Length of Last Word
#LeetCode# 35. Search Insert Position
POJ 2492 J-A Bug's Life
#Leetcode# 27. Remove Element
【前端】.easyUI.c#

原文地址：https://www.cnblogs.com/gousheng/p/7646584.html

Copyright © 2011-2022 走看看