zoukankan      html  css  js  c++  java
  • 爬虫(一)爬取鱼c淘贴信息

    掏出了以前的小练习;

    现在开始,每天复习下以前的爬虫练习,争取发现新的问题和可以优化的地方。

    # -*- coding:utf-8 -*-
    import requests
    import chardet
    import csv
    from lxml import etree
    import re
    
    def get_page(url):
        user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0'
        header = {'User-Agent':user_agent}
        r = requests.get(url,headers=header)
        r.encoding = chardet.detect(r.content)['encoding']
        page = r.text
        return page
    
    def parse_data(page):
        result = []
        html = etree.HTML(page)
        next_url = html.xpath('//a[@class="nxt"]/@href')
        if len(next_url) > 0:
            next_url = next_url[0]
            print(next_url)
        sites = html.xpath('//*[@class="xld xlda cl"]')
        
        for site in sites:
            title = site.xpath('.//a[@class="xi2"]/text()')[0]
            author = site.xpath('.//p[@class="xg1"]/a/text()')[0]
            theme = site.xpath('.//strong[@class="xi2"]/text()')[0]
            r = site.xpath('./dl/dd[2]/p[2]/text()')[0]
            sub_num,com_num =r.split(',')
            com_num = com_num.strip()
            sub_num = sub_num.strip()
            content = (title,author,theme,sub_num,com_num)
            result.append(content)
        return result,next_url
    
    def main():
        url = 'http://bbs.fishc.org/forum.php?mod=collection'
        results = []
        page = get_page(url)
        result,next_url = parse_data(page)
        results.extend(result)
        q = True
        while q:
            if next_url:
                page = get_page(next_url)
                result,next_url = parse_data(page)
                results.extend(result)
            else:
                q = False
        headers = ['title','author','theme','sub_num','com_num']
        with open(r'taotie.csv','w',encoding = 'utf-8') as f:
            f_csv = csv.writer(f)
            f_csv.writerow(headers)
            try:
                f_csv.writerows(results)
            except UnicodeDecodeError as e:
                print(e)
                
                
    if __name__ =="__main__":
        main()
  • 相关阅读:
    count(1)、count(*)与count(列名)的执行区别
    解析Json字符串中的指定的值
    消息队列的好处与弊端
    17 ~ express ~ 分类的显示 ,修改 和 删除
    Express ~ 获取表单 get 和 post 提交方式传送参数的对比
    16 ~ express ~ 添加博客分类
    JS ~ Promise 对象
    JS ~ Promise.reject()
    JS ~ 返回上一步
    PHP ~ 通过程序删除图片,同时删除数据库中的图片数据 和 图片文件
  • 原文地址:https://www.cnblogs.com/Alexisbusyblog/p/9343875.html
Copyright © 2011-2022 走看看