一个简单的百度爬虫 - 走看看

zoukankan html css js c++ java

一个简单的百度爬虫
0x00

　　之前不知道python怎么爬取百度的内容，因为看到有很多参数，直接复制下来改变wd参数总是会出现各种奇怪的问题

　　昨晚经程师傅指点才知道原来很多参数并不是必要的。今天才搜了下百度的各个参数的意义，以前居然没想到去搜一下百度的参数，感觉自己真是太愚钝了

　　于是，今天写了个小小的百度爬虫

0x01

　　代码：
#!/usr/bin/python # -*- coding:utf-8 -*- # 昏鸦 import requests import re import sys def get_baidu(s,page=5): pattern = "data-tools='{"title":"(.*?)","url":"(.*?)"" for p in xrange(0,page*10+1,10): req = "http://www.baidu.com/s?wd={}&pn={}&cl=3".format(s,p) res = requests.get(url=req).text reg = re.findall(pattern,res) for i in xrange(len(reg)): title = reg[i][0] url = requests.get(url=reg[i][1]).url print title+' '+url+' ' if __name__=='__main__': get_baidu(sys.argv[1],int(sys.argv[2]))
　　

　　结果：

0x02

　　只爬取了百度出来的标题和URL链接，默认爬取前5页
查看全文

相关阅读:
Docker容器部署 Nginx服务
 trap 的用法 /etc/init.d/rcS trap ：1 2 3 24
android system setup and building (2)
ubuntu 中建立tftp 服务器
 Mount nfs 报错Protocol not supported
android system setup and building (1)
mknod 详解
 linux 内核中已经定义的主设备号及查看设备设备号
 fstab 文件详解
 Linux根文件系统的挂载过程分析

原文地址：https://www.cnblogs.com/hun-ya/p/8734193.html

Copyright © 2011-2022 走看看