zoukankan html css js c++ java

python 爬虫小案例

爬取百度贴吧帖子信息

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# author: imcati
import  requests,re,time
class TiebaSpider(object):
    def __init__(self,tiebaName):
        self.tiebaName=tiebaName
        self.base_url='https://tieba.baidu.com/f?kw='+tiebaName+'&ie=utf-8&pn={}'
        self.headers={ 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
    #构造请求
    def get_url_list(self):
        url_list=[]
        for i in range(5):
            url_list.append(self.base_url.format(i*50))
        return url_list
    #获取页面信息
    def get_pageInfo(self,url):
        response=requests.get(url=url,headers=self.headers)
        return self.parse_pageInfo(response.content.decode('utf-8'))
    #解析数据
    def parse_pageInfo(self,html):
        pattern=re.compile('<div class="t_con cleafix".*?<a rel="noreferrer" href="(.*?)" title="(.*?)" target=.*?</div>',re.S)
        return re.findall(pattern,html)
    #保存抓取信息
    def save_info(self,info):
        for value_info in info:
            info_str = '帖子信息：' + value_info[1] + '帖子链接：https://tieba.baidu.com' + value_info[0] + '
'
            with open('./tieba','ab') as f:
                f.write(info_str.encode("utf-8"))
    def run(self):
        url_list=self.get_url_list()
        for url in url_list:
            info=self.get_pageInfo(url)
            self.save_info(info)
            time.sleep(1)

if __name__=="__main__":
    tiebaspider=TiebaSpider('python')
    tiebaspider.run()

赠人玫瑰，手有余香，如果我的文章有幸能够帮到你，麻烦帮忙点下右下角的推荐，谢谢！

作者： imcati

出处： https://www.cnblogs.com/imcati/>

本文版权归作者所有，欢迎转载，但未经作者同意必须保留此段声明，且在文章页面明显位置给出, 原文链接

查看全文

相关阅读:
[React Native] Error Handling and ActivityIndicatorIOS
[GIF] Colors in GIF Loop Coder
[React Native] Passing data when changing routes
[Javascript] Object.freeze() vs Object.seal()
[React Native] State and Touch Events -- TextInput, TouchableHighLight
[GIF] GIF Loop Coder
[GIF] GIF Loop Coder
[Angular 2] ROUTING IN ANGULAR 2 REVISITED
Log文件太大，手机ROM空间被占满
 strcpy,memcpy,memmove和内存重叠分析

原文地址：https://www.cnblogs.com/imcati/p/11218091.html