zoukankan      html  css  js  c++  java
  • 爬虫(一)爬取鱼c淘贴信息

    掏出了以前的小练习;

    现在开始,每天复习下以前的爬虫练习,争取发现新的问题和可以优化的地方。

    # -*- coding:utf-8 -*-
    import requests
    import chardet
    import csv
    from lxml import etree
    import re
    
    def get_page(url):
        user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0'
        header = {'User-Agent':user_agent}
        r = requests.get(url,headers=header)
        r.encoding = chardet.detect(r.content)['encoding']
        page = r.text
        return page
    
    def parse_data(page):
        result = []
        html = etree.HTML(page)
        next_url = html.xpath('//a[@class="nxt"]/@href')
        if len(next_url) > 0:
            next_url = next_url[0]
            print(next_url)
        sites = html.xpath('//*[@class="xld xlda cl"]')
        
        for site in sites:
            title = site.xpath('.//a[@class="xi2"]/text()')[0]
            author = site.xpath('.//p[@class="xg1"]/a/text()')[0]
            theme = site.xpath('.//strong[@class="xi2"]/text()')[0]
            r = site.xpath('./dl/dd[2]/p[2]/text()')[0]
            sub_num,com_num =r.split(',')
            com_num = com_num.strip()
            sub_num = sub_num.strip()
            content = (title,author,theme,sub_num,com_num)
            result.append(content)
        return result,next_url
    
    def main():
        url = 'http://bbs.fishc.org/forum.php?mod=collection'
        results = []
        page = get_page(url)
        result,next_url = parse_data(page)
        results.extend(result)
        q = True
        while q:
            if next_url:
                page = get_page(next_url)
                result,next_url = parse_data(page)
                results.extend(result)
            else:
                q = False
        headers = ['title','author','theme','sub_num','com_num']
        with open(r'taotie.csv','w',encoding = 'utf-8') as f:
            f_csv = csv.writer(f)
            f_csv.writerow(headers)
            try:
                f_csv.writerows(results)
            except UnicodeDecodeError as e:
                print(e)
                
                
    if __name__ =="__main__":
        main()
  • 相关阅读:
    局部特征点检测 (Local Point Detector)
    算法
    64位编程
    QT开发之mock原理
    C#实现全角字符和半角字符转换
    QTableWidget基本功能总结(转)
    非const引用不能绑定非左值(nolvalue) .
    QTableWidget 应用总结
    QString和string类型相互转换(转)
    XML解析中文字符
  • 原文地址:https://www.cnblogs.com/Alexisbusyblog/p/9343875.html
Copyright © 2011-2022 走看看