day3 - 走看看

zoukankan html css js c++ java

day3

爬虫
python库
1，requests 用来获取页面内容
2，Beautiful Soup

# 传入url,获取页面soup对象
def getSoup(url):
# 加入header防止网站防爬虫机制

请求头需要注意的参数：

（1）Referrer：访问源至哪里来（一些大型网站，会通过Referrer 做防盗链策略；所有爬虫也要注意模拟）

（2）User-Agent:访问的浏览器（要加上否则会被当成爬虫程序）

（3）cookie：请求头注意携带

header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
}

# 获取链接内容
response = requests.get(url, headers=header)
soup = BeautifulSoup(response.text, 'lxml')
return soup

按照页面结果获取页面数据
houseInfos=soupHouse.find_all('li',class_='fl oneline')

小工具
# 从前到后传入字符串按照字符截取
def splitByStr(self,character,position='front'):
# 如果没有截取的字符串返回本身
if self.find(character)<0:
return self
# 前半截
if position=='front':
return self[:self.index(character)]
# 后半截
else:
return self[self.index(character)+len(character):]

查看全文

相关阅读:
Ubuntu 16.04安装迅雷（兼容性不高）
Ubuntu 16.04安装QQ（不一定成功）
Ubuntu查看隐藏文件夹的方法
 Ubuntu下非常规方法安装绿色软件（压缩包）
Ubuntu下常规方法安装软件
 Ubuntu 16.04下截图工具Shutter
java中 awt Graphics2D
Vue2.0总结———vue使用过程常见的一些问题
 MySQL 中隔离级别 RC 与 RR 的区别
 DBAplus社群线上分享----Sharding-Sphere之Proxy初探

原文地址：https://www.cnblogs.com/tutuwowo/p/10867106.html

Copyright © 2011-2022 走看看