zoukankan html css js c++ java

网络爬虫Python（一）

1、爬取页面，打印页面信息

 1 import requests
 2 
 3 # get请求
 4 response_get=requests.get("https://www.baidu.com") # 生成一个response对象
 5 response_get.encoding=response_get.apparent_encoding # 设置编码格式
 6 
 7 # post请求
 8 response_post = requests.post("http://httpbin.org/post")
 9 response_post.encoding=response_post.apparent_encoding
10 
11 print("抓取百度网页html内容如下(get请求)：")
12 print(response_get.text)
13 print("抓取百度网页html内容如下(post请求)：")
14 print(response_post.text)

2、关于反爬机制页面的处理

 1 # 关于绕过反爬机制
 2 response_get=requests.get("http://www.zhihu.com") # 生成一个response对象
 3 response_get.encoding=response_get.apparent_encoding # 设置编码格式
 4 print("不设置头信息，状态码：",str(response_get.status_code))
 5 print("抓取网页html内容如下(get请求)：")
 6 print(response_get.text)
 7 
 8 # 设置User-Agent，添加头部信息,伪装浏览器
 9 headers={
10     "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
11 }
12 response_get=requests.get("https://www.zhihu.com",headers=headers)
13 response_get.encoding=response_get.apparent_encoding
14 print("设置头信息，状态码：",str(response_get.status_code))
15 print("抓取网页html内容如下(get请求)：")
16 print(response_get.text)

3、爬取信息并保存到本地方法

 1 import requests
 2 
 3 # get请求
 4 response_get = requests.get("http://www.baidu.com")  # 生成一个response对象
 5 response_get.encoding = response_get.apparent_encoding  # 设置编码格式
 6 print("抓取网页html内容如下(get请求)：")
 7 print(response_get.text)
 8 # 爬取信息并保存到本地方法1：
 9 with open("./file/zhongyan.html", "w", encoding="utf-8") as f:
10     f.write(response_get.text)
11     f.close()
12 # 爬取信息并保存到本地方法2：
13 file = open("./file/zhongyan1.html", "w", encoding="utf-8")
14 file.write(response_get.text)
15 file.close()

4、美化爬出html信息

1 import requests
2 from bs4 import BeautifulSoup
3 
4 # get请求
5 response_get = requests.get("http://www.baidu.com")  # 生成一个response对象
6 response_get.encoding = response_get.apparent_encoding  # 设置编码格式
7 print("抓取网页html内容如下(get请求)：")
8 soup=BeautifulSoup(response_get.text,"html.parser")
9 print(soup.prettify())

5、整体代码如下：

 1 import requests
 2 from bs4 import BeautifulSoup
 3 
 4 # get请求
 5 headers = {
 6     "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
 7 }
 8 response_get = requests.get("http://www.baidu.com", headers=headers)  # 生成一个response对象
 9 response_get.encoding = response_get.apparent_encoding  # 设置编码格式
10 print("抓取网页html内容如下(get请求)：")
11 # 美化爬出数据展示
12 soup = BeautifulSoup(response_get.text, "html.parser")
13 # prettify()每逢标签，自动换行
14 print(soup.prettify())
15 # 爬取信息并保存到本地方法1：
16 with open("./file/baidu.html", "w", encoding="utf-8") as f:
17     f.write(soup.prettify())
18     f.close()
19 # 爬取信息并保存到本地方法2：
20 file = open("./file/baidu1.html", "w", encoding="utf-8")
21 file.write(soup.prettify())
22 file.close()

查看全文

相关阅读:
Linux 下安装配置 JDK7
win7下virtualbox装linux共享win7文件问题(已测试可用)
Linix常用命令
 JAVA命令大全
 virtualbox 不能为虚拟电脑打开一个新任务/VT-x features locked or unavailable in MSR.
VirtualBox下安装rhel5.5 linux系统
 redhat RHEL 5.5 下载地址
 ios开发@selector的函数如何传参数/如何传递多个参数
 U盘10分钟安装linux系统
 史上最浅显易懂的Git分布式版本控制系统教程

原文地址：https://www.cnblogs.com/lxj-dream/p/14429291.html